122 lines
14 KiB
Markdown
122 lines
14 KiB
Markdown
# Whalescale: Anti-Pattern & Risk Document
|
||
|
||
This document outlines the architectural boundaries of the Whalescale project. It identifies approaches that have been explicitly rejected, potential design pitfalls, and the non-obvious risks inherent in a peer-to-peer VPN with best-effort NAT traversal.
|
||
|
||
## 1. Rejected Approaches
|
||
|
||
### 1.1 Centralized Discovery (DHT/Kademlia)
|
||
* **Why Rejected:** Whalescale targets a high-trust, known-client environment. DHTs introduce complexity, latency in discovery, and reliance on bootstrap nodes that act as pseudo-gateways. The gossip + anchor model provides discovery with lower complexity for the intended network scale.
|
||
|
||
### 1.2 Dedicated TURN/STUN Infrastructure
|
||
* **Why Rejected:** The core mission is to avoid dependency on dedicated infrastructure. Whalescale does not use STUN servers for address observation (peers observe each other) or TURN servers for relay (anchor nodes relay encrypted packets through existing tunnels). The same functionality emerges from the P2P mesh without requiring separately operated services.
|
||
|
||
### 1.3 WireGuard Kernel Module as Data Plane
|
||
* **Why Rejected:** WireGuard manages peer endpoints autonomously (roaming silently overrides programmatic `wg set`), creating an unresolvable conflict with the control plane. It has no multipath capability, no extensibility for control messages, and binds a single UDP socket per interface. The custom userspace Noise_IK transport provides identical cryptographic security while allowing full control over endpoint management, multipath, and integrated control messaging.
|
||
|
||
### 1.4 TCP-Based Connectivity
|
||
* **Why Rejected:** TCP is unsuitable for P2P hole punching due to the strict state requirements of the protocol. UDP is the only viable transport for NAT traversal.
|
||
|
||
### 1.5 Port Prediction / Port Sweeping
|
||
* **Why Rejected:** Port prediction is architecturally defeated by CGNAT. A CGNAT device shares a single public IP across thousands of subscribers; the external port assigned depends on every other subscriber's concurrent activity, making sequential prediction infeasible. Port sweeping (e.g., ±50 ports) covers a negligible fraction of the CGNAT port range (typically 10,000–60,000+) and wastes battery and bandwidth with near-zero success probability. This mechanism has been removed entirely. Symmetric ↔ Symmetric connectivity is handled via anchor relay instead.
|
||
|
||
### 1.6 Flow-Level Scheduling for Multipath
|
||
* **Why Rejected:** Flow-level scheduling (all packets from a 5-tuple go to the same path) avoids reordering but cannot aggregate bandwidth for single flows. The primary use case — a mobile device streaming video over 5G + WiFi — is a single flow that needs more bandwidth than either path alone provides. Flow-level scheduling would make multipath useful only for failover, not aggregation.
|
||
|
||
### 1.7 Per-Path Reliability / Retransmission
|
||
* **Why Rejected:** Adding retransmission at the multipath layer creates double retransmission for inner TCP (inner TCP retransmits, then Whalescale retransmits the same data), wasting bandwidth and increasing latency. It also breaks inner UDP semantics — inner UDP expects unreliable delivery, and adding reliability at the transport layer introduces unpredictable latency spikes. Whalescale is an unreliable VPN transport; retransmission is the inner protocol's responsibility.
|
||
|
||
### 1.8 Layer 2 Bonding
|
||
* **Why Rejected:** Bonding network interfaces at layer 2 (e.g., Linux bonding driver) requires both interfaces to be on the same physical network segment and does not work across heterogeneous paths like 5G + WiFi. It also requires kernel-level configuration and cannot leverage path-specific scheduling intelligence (knowing that one path has lower latency, another has higher bandwidth).
|
||
|
||
## 2. Bad Design Decisions (What to Avoid)
|
||
|
||
### 2.1 Treating Anchors as Optional
|
||
* **Risk:** The anchor-first topology is the load-bearing wall of the connectivity model. If all cached endpoints for mobile nodes become stale and no anchor is reachable, the network cannot re-converge. Anchors are not just "nice to have" — they are the mechanism by which mobile peers re-enter the network after IP changes.
|
||
* **Rule:** The system must treat anchors as a first-class concept, track anchor availability, and warn when only one anchor remains.
|
||
|
||
### 2.2 Ignoring UPnP/NAT-PMP/PCP
|
||
* **Risk:** Relying solely on hole punching leads to low connection success rates. Proactive port mapping converts a cone NAT into an effectively public endpoint — the highest-value NAT traversal mechanism available.
|
||
* **Rule:** Every Whalescale node must attempt UPnP/NAT-PMP/PCP port mapping on startup.
|
||
|
||
### 2.3 Wall-Clock Timestamps in Gossip
|
||
* **Risk:** Clock skew between nodes makes wall-clock timestamps unreliable for conflict resolution. Two nodes observing the same peer at different times with skewed clocks will produce conflicting "latest" states.
|
||
* **Rule:** Use monotonically increasing sequence numbers (Lamport-style) for ordering gossip updates. Self-attested endpoints always win over third-party observations.
|
||
|
||
### 2.4 Excessive Gossip Frequency
|
||
* **Risk:** Unbounded gossip leads to broadcast storms, consuming bandwidth and CPU on mobile devices.
|
||
* **Rule:** Gossip must use bounded fanout (random subset of neighbors, not all), periodic anti-entropy for convergence, and per-peer rate limiting.
|
||
|
||
### 2.5 Fighting WireGuard's Endpoint Management
|
||
* **Risk:** If WireGuard is used, its autonomous roaming will silently override programmatic `wg set` calls, causing the control plane's view of peer endpoints to diverge from reality.
|
||
* **Rule:** This is why Whalescale owns its data plane. If WireGuard compatibility mode is used (single-path, WireGuard wire format), endpoint management must still be driven by Whalescale with WireGuard's roaming suppressed or monitored.
|
||
|
||
### 2.6 Implementing Crypto from Scratch
|
||
* **Risk:** The Noise_IK handshake and transport encryption must be implemented using an established, audited Rust Noise library. Rolling custom crypto is a security catastrophe.
|
||
* **Rule:** Use `snow` or equivalent. Only the transport, scheduling, and multipath logic should be original code.
|
||
|
||
### 2.7 Ignoring LAN-Local Communication
|
||
* **Risk:** Two nodes on the same LAN that communicate through NAT hairpinning suffer unnecessary latency and dependency on the router's hairpin implementation (which is often broken). They should discover each other and communicate directly on the LAN.
|
||
* **Rule:** Implement LAN-local discovery (mDNS or broadcast) and bypass NAT for same-network peers.
|
||
|
||
### 2.8 Adding Per-Path Congestion Control Without Understanding Double-CC
|
||
* **Risk:** The inner traffic (especially TCP) already has its own congestion control. If Whalescale also implements per-path congestion control, the two controllers interfere: congestion on one Whalescale path causes inner TCP to reduce its rate across the entire connection, even if other Whalescale paths are uncongested. This is the "coupled congestion control" problem from MPTCP, but Whalescale cannot solve it the way MPTCP does (by modifying the inner TCP's congestion controller) because the inner TCP is inside the encrypted tunnel.
|
||
* **Rule:** The multipath scheduler uses bandwidth estimation and path health monitoring (RTT, loss rate) for scheduling decisions, NOT congestion control. Do not implement send-rate limiting at the Whalescale layer. Let the inner traffic's own congestion control drive the data rate.
|
||
|
||
### 2.9 Ignoring the Reordering Buffer Under Path Failure
|
||
* **Risk:** When a path fails, packets in flight on that path are lost. If the reordering buffer doesn't flush the gaps from the failed path, it blocks indefinitely waiting for packets that will never arrive. This stalls all traffic through the session.
|
||
* **Rule:** Path failure must trigger immediate gap flush for the failed path's in-flight packets. The reordering buffer must have a bounded maximum depth and force-skip gaps that exceed the timeout.
|
||
|
||
### 2.10 Setting VPN MTU to a Single Path's MTU in a Multipath Session
|
||
* **Risk:** If the VPN MTU is set to the MTU of the fastest path, and a slower path has a smaller MTU, full-size packets sent on the slower path will be fragmented or dropped. This causes silent throughput degradation and inner TCP retransmissions.
|
||
* **Rule:** VPN MTU must be the minimum across all active paths. Recalculate when paths are added or removed.
|
||
|
||
### 2.11 Assuming Multipath Always Improves Performance
|
||
* **Risk:** With paths that have very different latencies (e.g., 20ms WiFi + 200ms satellite), aggressive packet-level scheduling creates deep reordering that the buffer must absorb. The added reordering delay may reduce inner TCP goodput below what single-path (WiFi only) would achieve. The scheduler's reordering depth constraint should prevent this, but the constraint itself limits bandwidth aggregation.
|
||
* **Rule:** Multipath performance must be validated with test benching, not assumed. The scheduler must be willing to deprioritize or skip paths that cause more harm than benefit.
|
||
|
||
## 3. Non-Obvious Risks & Negative Aspects
|
||
|
||
### 3.1 The Battery Drain Problem (Mobile Nodes)
|
||
* **Challenge:** Recovery Mode probing is radio-intensive. Mobile devices aggressively trying to reconnect will deplete battery.
|
||
* **Mitigation:** Exponential backoff (1s → 2s → 4s → ... → 60s cap). Prefer reaching an anchor first (single probe, high probability of success) before attempting other cached endpoints.
|
||
|
||
### 3.2 The Split-Brain Network State
|
||
* **Challenge:** Two network segments can become isolated. Nodes in each segment believe the other segment's nodes are dead.
|
||
* **Mitigation:** Periodic anti-entropy exchanges (slow timer, e.g., every 5 minutes) with random peers. When segments reconnect, self-attested endpoints with sequence numbers provide unambiguous conflict resolution — the higher sequence number wins regardless of which segment produced it.
|
||
|
||
### 3.3 Gossip-Based DoS (Malicious Node)
|
||
* **Challenge:** A connected but malicious node can flood the network with fake observed-endpoint gossip for legitimate peers, pointing them to blackhole addresses.
|
||
* **Mitigation:** Self-attested endpoints are signed by the peer's own Ed25519 key. A peer's own declaration of its address is always authoritative over any third-party observation. Observed endpoints from third parties are advisory hints only, never authoritative.
|
||
|
||
### 3.4 Single Anchor as Single Point of Failure
|
||
* **Challenge:** If the network has only one anchor and it goes offline, mobile nodes lose their reconnection mechanism. The gossip protocol cannot help because mobile nodes behind symmetric NAT cannot be reached by other mobile nodes.
|
||
* **Mitigation:** Recommend at least two anchors on different ISPs. Detect and warn when only one anchor remains. Dual-anchor mutual keepalive keeps both anchors' addresses current.
|
||
|
||
### 3.5 Permanent Partition of Mobile Nodes
|
||
* **Challenge:** A mobile node that loses all cached endpoints (e.g., all peers have rotated IPs simultaneously) cannot re-enter the network without out-of-band resynchronization.
|
||
* **Mitigation:** This is an accepted limitation, not a bug. The system should clearly communicate this state to the user and require manual re-bootstrap. Mitigate probability by maintaining connections to multiple anchors on different networks.
|
||
|
||
### 3.6 Userspace Performance Ceiling
|
||
* **Challenge:** Userspace crypto and transport tops out around 1–2 Gbps on modern hardware, compared to ~4 Gbps for WireGuard in the kernel.
|
||
* **Assessment:** Acceptable for the intended use case (mobile, home, small office networks). If performance becomes critical, a kernel bypass path (e.g., AF_XDP) could be added later without changing the architecture.
|
||
|
||
### 3.7 Relay Amplification
|
||
* **Challenge:** Anchor relay for symmetric ↔ symmetric pairs means the anchor's bandwidth is consumed forwarding traffic between two nodes that cannot connect directly. If many mobile pairs rely on the same anchor, the anchor becomes a bottleneck.
|
||
* **Mitigation:** The relay is encrypted packet forwarding, not decryption/re-encryption, so CPU cost is minimal. Bandwidth cost is real. Distribute relay load across multiple anchors. Long-term, encourage IPv6 adoption to eliminate the relay need.
|
||
|
||
### 3.8 Bootstrap Protocol Exposure
|
||
* **Challenge:** Pre-session bootstrap messages (PATH_PROBE, PATH_PROBE_REPLY) are unencrypted. An observer can learn node IP addresses and who is attempting to connect to whom.
|
||
* **Assessment:** This is metadata only — no user data or session keys are exposed. Full authentication and encryption begin with the Noise_IK handshake immediately after bootstrap. This tradeoff is inherent to any system that must establish connectivity before authenticating.
|
||
|
||
### 3.9 Inner TCP Fast Retransmit from Reordering
|
||
* **Challenge:** The multipath reordering buffer may not release packets before the inner TCP's loss detection triggers (3 duplicate ACKs → fast retransmit). This causes unnecessary retransmissions that waste bandwidth on all paths. With MAX_REORDERING_DEPTH = 128, Linux auto-tunes `tcp_reordering` to tolerate this, but Windows and macOS may not.
|
||
* **Mitigation:** The reordering buffer's adaptive timeout (based on measured path latency spread) is the primary defense — if it delivers packets fast enough, inner TCP never sees reordering. The sender-side reordering depth constraint prevents fast paths from getting too far ahead. Test benching will quantify the actual false retransmit rate on different inner TCP stacks.
|
||
|
||
### 3.10 Reordering Buffer Memory Under High Throughput
|
||
* **Challenge:** At high packet rates with paths that have large RTT differences, the reordering buffer can hold many packets simultaneously. At 100,000 packets/sec with 128-packet depth and 1500 bytes/packet: ~192KB per session. For 50 concurrent peer sessions: ~9.6MB. This is acceptable, but pathological cases (many sessions, all with high throughput) should be monitored.
|
||
* **Mitigation:** The MAX_REORDERING_DEPTH = 128 provides a hard memory bound per session. The force-skip mechanism ensures the buffer cannot grow beyond this limit.
|
||
|
||
### 3.11 Multipath Scheduler Oscillation
|
||
* **Challenge:** If bandwidth estimates fluctuate rapidly (e.g., due to bursty cross-traffic on a shared WiFi link), path weights may oscillate, causing the scheduler to repeatedly shift traffic between paths. This creates reordering patterns that are harder for the buffer to absorb than steady-state scheduling.
|
||
* **Mitigation:** Use a rolling 1-second window for bandwidth estimation (not instantaneous). Apply hysteresis to weight changes — only change a path's weight when the new estimate differs by more than 20% from the current weight.
|