SECURITY ADVISORY — QUIC Socket Exhaustion (Host DNS DoS via Network-Triggerable Resource Leak)

SECURITY ADVISORY — QUIC Socket Exhaustion (Host DNS DoS via Network-Triggerable Resource Leak)

Issued by: GRIDNET Emergency Response Team
Issued: 2026-05-26
Status: ACTIVELY MITIGATED — patched build available; operators must redeploy.
Severity: HIGH — host-wide denial-of-service of UDP-dependent services (DNS, NTP, WebRTC), remotely triggerable by any network participant capable of advertising peer addresses to a target node.
Affected: All GRIDNET Core 2.0.0 builds prior to r6923.
Fixed in: r6921 (primary handler), r6923 (audit follow-ups). Both required.


1. Executive summary

A network-triggerable resource leak in the QUIC connection-callback layer was discovered actively being exploited in the wild against at least one operator node this week. Each failed outbound QUIC handshake permanently leaked one MsQuic-owned IPv6 UDP datapath socket. Sustained at the observed leak rate, a victim node exhausts the entire Windows ephemeral UDP port range (16,384 ports) within approximately 39 hours, after which every UDP-dependent service on the host machine fails simultaneously — including the operating system’s DNS resolver, NTP client, browser DoH fallback, and every other application using outbound UDP. The host becomes externally unreachable, the GRIDNET node loses peer discovery, and the operator typically diagnoses it as a generic “network outage” until the GRIDNET Core process is force-restarted, at which point the leaked sockets are released and the host recovers within seconds.

The leak is induced by the act of attempting outbound QUIC connections to unreachable peers. An attacker who can influence the victim’s dial queue — even indirectly, via DHT/peer-list poisoning, gossip-relayed peer advertisements, or simply by being a peer that goes silent after advertising itself — converts the victim’s own peer-discovery cadence into a cumulative resource-exhaustion vector. No authentication, no handshake completion, and no on-chain transaction is required by the attacker. The cost to the attacker is zero. The cost to the victim is permanent UDP socket consumption until process restart.

Patched binaries are available now. All operators are urged to redeploy immediately.


2. Attack vector

2.1 Pre-conditions

The attacker needs only one capability: the ability to cause the victim’s GRIDNET Core process to attempt an outbound QUIC connection to an IP/port that does not complete the QUIC handshake within 10 seconds (the configured HandshakeIdleTimeoutMs). Achievable through any of:

  • Advertising an unreachable IP via the peer-gossip channel.
  • Advertising a reachable host that does not actually run a GRIDNET QUIC listener on the expected port.
  • Operating a peer that completes a TCP-layer probe (so the victim believes the host is alive) but silently drops UDP packets on the QUIC port.
  • Being a peer that intentionally fails the QUIC TLS handshake by responding with malformed Initial packets or by aborting after the first round-trip.
  • Disconnecting from the network shortly after advertising the address (the victim continues attempting reconnection for hours).

None of these require authenticated participation in the network. None require successful completion of the handshake. None of them are individually anomalous — they are the natural noise of any decentralized peer-to-peer network. The vulnerability is in how the victim handled the resulting handshake timeouts.

2.2 Mechanism

Outbound QUIC connections in GRIDNET Core are initiated through Microsoft’s MsQuic library via MsQuic->ConnectionOpen followed by MsQuic->ConnectionStart. Internally, MsQuic allocates a dedicated UDP datapath socket per connection, bound to an ephemeral port on the IPv6 wildcard address ::. This socket is owned by MsQuic and lives until the application explicitly calls MsQuic->ConnectionClose.

When the handshake fails — which it does on every attempt against an unreachable, malformed, or silent peer — MsQuic delivers a QUIC_CONNECTION_EVENT_SHUTDOWN_INITIATED_BY_TRANSPORT event followed by a QUIC_CONNECTION_EVENT_SHUTDOWN_COMPLETE event. Per the MsQuic API contract documented in msquic.h line 1119 (“Ready for the handle to be closed”), the application is required to call ConnectionClose on receipt of SHUTDOWN_COMPLETE, except in the narrow case where Event->SHUTDOWN_COMPLETE.AppCloseInProgress == true (which signals that the callback is being delivered synchronously from inside the application’s own ConnectionClose call, and re-calling would be a double-close).

The pre-r6921 implementation of CConversation::ConnectionCallback’s SHUTDOWN_COMPLETE branch in conversation.cpp had this guard inverted:

if (appCloseInProgress) {                       // INVERTED: should be !appCloseInProgress
    if (flags.QUICConnectionClosed == false) {
        flags.QUICConnectionClosed = true;
        setFlags(flags);
        MsQuic->ConnectionClose(Connection);
    }
}

This meant ConnectionClose was called only in the case MsQuic explicitly forbade (and never reached in practice, because shutdownConnection() issues ConnectionShutdown, never ConnectionClose), and was skipped in every other case — including every handshake-timeout path. The connection handle, and its associated MsQuic-internal datapath UDP socket, leaked permanently.

2.3 Exploit economics

Each leaked socket consumes one ephemeral port from the Windows dynamic-port range (default: 49152–65535, total 16,384). Observed leak rate on a victim node: ~7 sockets per minute under organic peer-discovery traffic alone.

  • Time to ephemeral port exhaustion: ~16,384 / 7 = 39 hours of uptime.
  • Time to system-wide DNS failure: same.
  • Attacker amplification: by advertising additional unreachable peers into the victim’s gossip mesh, the rate is bounded above by the victim’s parallel-dial limit (≥10 concurrent dials) × handshake-timeout floor (10 seconds). Theoretical maximum induced rate is therefore ~60 sockets/minute, reducing exhaustion time to under 5 hours.
  • After exhaustion: every outbound UDP bind() on the host returns WSAENOBUFS. The Windows DNS client uses ephemeral UDP, so Resolve-DnsName, nslookup, getaddrinfo(), browser DNS — all fail simultaneously across every process on the host. The operator’s diagnostic tools (which themselves need DNS) also fail. The fault presents as a total network outage but is in fact a single misbehaving process starving the host of UDP ports.

2.4 Observed exploitation

A live operator node was confirmed in this exact state on 2026-05-24/26: PID 7808 holding 16,381 UDP endpoints (16,210 on ::), the entire ephemeral port range consumed, host DNS failing with WSAENOBUFS. Process inspection via CDB confirmed that every entry in CQUICConversationsServer::mConversations had mQUICConnectionHandle = 0 (the application had given up on the conversation), mCeaseCommunication = true (the conversation was logically dead), and mFlags.QUICConnectionClosed = false (ConnectionClose had never been called) — the leak signature in full. Restarting the process released the sockets and restored DNS within seconds.

It is not possible to determine retroactively whether the precipitating handshake-failure traffic was organic peer churn or attacker-induced; both produce the same observable state. The attack surface is intrinsic to the bug — any peer or peer-advertiser, malicious or not, can trigger it.


3. Remediation

3.1 Code fixes (already committed and built)

r6921 — invert the AppCloseInProgress guard in CConversation::ConnectionCallback’s QUIC_CONNECTION_EVENT_SHUTDOWN_COMPLETE branch (conversation.cpp). The connection handle is now closed on every transport-initiated and peer-initiated shutdown, exactly as the MsQuic contract requires. The per-conversation mConfiguration is also captured-then-nulled under mFieldsGuardian before ConfigurationClose is called on the captured value, eliminating a dangling-pointer window for racing readers.

r6923 — three follow-up fixes flagged by a comprehensive state-machine audit of the QUIC handle lifecycle:

  1. Outbound ConnectionOpen failure path (conversation.cpp around line 545): the per-conversation mConfiguration created moments earlier was leaking on early-return. Now released using the same capture-then-close idiom as the SHUTDOWN_COMPLETE branch.
  2. mListenerHandle was a dead field, never written. StartQuicServer now assigns the handle to the member after a successful ListenerStart, so explicit shutdown can reach it.
  3. CQUICConversationsServer::shutdownQUIC() existed but had no callers. Now invoked from stop() after killServerSocket(). It releases the listener handle, the registration handle, and the MsQuicClose API table in the correct order.

The fixes are minimal, surgical, and free of behavior changes outside the leak paths. Total change footprint: two source files, two compile units.

3.2 Operator action required

  1. Pull the patched build. SVN HEAD at r6923 or later. Confirm svn log shows both r6921 and r6923 in the working copy’s history.
  2. Verify the binary. A patched GRIDNET Core.exe has a build timestamp of 2026-05-26 12:16 UTC or later.
  3. Stop the running unpatched node. Stop-Process -Name 'GRIDNET Core' -Force (Windows) or the equivalent on your platform. Stopping is destructive to in-memory state but harmless to on-disk blockchain data. If your host’s DNS has already failed due to this bug, the stop also restores host DNS within seconds — do not wait.
  4. Start the patched build. Standard launch sequence with the working directory set to GRIDNETCore\GRIDNET\ as documented.
  5. Verify the fix. After ~5 minutes of uptime, run:
    (Get-NetUDPEndpoint -OwningProcess (Get-Process 'GRIDNET Core').Id).Count
    
    Expected: under 50. On the same host before the patch this value would have been growing past 5,000 within the first 12 hours. A patched node maintains a roughly flat UDP endpoint count regardless of uptime.

3.3 Operational mitigation if redeployment must be delayed

If for any reason a patched build cannot be deployed immediately, the operational workaround is a scheduled process restart well before the 39-hour exhaustion window:

# Run every 12 hours via Task Scheduler. Adjust paths as needed.
Stop-Process -Name 'GRIDNET Core' -Force
Start-Sleep -Seconds 5
Start-Process -FilePath 'C:\path\to\GRIDNET Core.exe' -WorkingDirectory 'C:\path\to\GRIDNET\'

This is a band-aid that does not eliminate the attack surface — an attacker can still induce DNS failure for any host whose GRIDNET node has been up longer than the attacker’s induced exhaustion window. The only complete remediation is the patched build.

Widening the Windows ephemeral port range (netsh int ipv4 set dynamicport udp start=10000 num=55000) is also a band-aid — it lengthens the exhaustion window by a constant factor but does not close the underlying leak.


4. Detection and forensics

4.1 Live detection

A node currently subject to this bug — whether organically or under active exploitation — will exhibit ALL of the following:

# 1. UDP endpoint count anomaly
(Get-NetUDPEndpoint -OwningProcess (Get-Process 'GRIDNET Core').Id).Count
# Healthy:    < 50
# Suspicious: > 1,000
# Pre-failure: > 10,000
# Failed:     > 15,000

# 2. Bind-address signature
Get-NetUDPEndpoint -OwningProcess (Get-Process 'GRIDNET Core').Id |
    Group-Object LocalAddress | Sort-Object Count -Descending
# Patched: roughly equal distribution across ::, 0.0.0.0, 127.0.0.1
# Bug:     16,000+ entries on `::` with random high ephemeral ports

# 3. Host DNS health (the user-visible end-state)
Resolve-DnsName cloudflare.com -Server 1.1.1.1
# Patched: succeeds in milliseconds
# Failed:  WSAENOBUFS, "An operation on a socket could not be performed
#          because the system lacked sufficient buffer space"

4.2 Post-incident forensics

CDB attach to the running process and walk:

TestNet BM -> mNetworkManager -> mQUICServer -> mConversations[i]

For each conversation entry, check mFlags.QUICConnectionClosed. If this is false while mCeaseCommunication == true and mQUICConnectionHandle == 0, the conversation was abandoned without ConnectionClose ever having been issued — the bug signature.

A patched node will never produce this state: every entry whose mCeaseCommunication is true will also have QUICConnectionClosed true.


5. Timeline

Time (local) Event
2026-05-23 ~02:00 Operator restarted GRIDNET Core (uptime clock for the affected incident begins)
2026-05-26 ~10:30 Operator reports host-wide DNS failure; UDP endpoint count measured at 16,381
2026-05-26 ~10:40 Process inspection via CDB; root cause identified as inverted AppCloseInProgress guard
2026-05-26 11:34 Primary fix committed as r6921
2026-05-26 ~11:50 Comprehensive state-machine audit identifies three additional leak paths
2026-05-26 12:17 Follow-up fixes committed as r6923; patched binary built
2026-05-26 12:17 Patched binary deployed; UDP endpoint count drops to 5 and remains flat under organic peer-discovery traffic
2026-05-26 ~13:00 This advisory issued

6. Acknowledgements and credit

This bug was identified, triaged, root-caused, and fixed entirely from a live production node under active failure conditions, without service interruption beyond the single forced restart required to deploy the patched binary. The Emergency Response Team thanks the affected operator for the high-quality bug report and for keeping the failing process alive long enough for forensic analysis. Do not kill a leaking node before its state has been captured — the process memory is the highest-fidelity record of how the bug was actually reached, and a post-mortem dump is sometimes the difference between a one-day fix and a one-month fix.

If you are an operator and you notice the symptoms of this advisory on your own node, contact the Emergency Response Team channel immediately rather than restarting silently. The fix is now in place for this specific bug, but the same general class — application-layer mishandling of asynchronous resource-release callbacks in a P2P stack — recurs across many codebases. If you suspect any similar pattern, we want to know.


7. Lessons internalized

  • Every API contract violation is a leak, regardless of how plausible the inverted code looks. The AppCloseInProgress guard read naturally as “if the app is closing, close the handle,” which is exactly backwards from the documented MsQuic semantics. Inverted-boolean defects of this kind survive code review precisely because both readings are grammatical English; only the API documentation disambiguates them. We will be auditing every async-callback handler in CConversation against its upstream library contract in the coming sprint.
  • A slow leak that takes 39 hours to manifest is indistinguishable from a fast leak that runs for 39 hours. This bug had been present in the codebase across multiple releases without triggering operator complaints, because operators reflexively restart nodes well before the 39-hour exhaustion window during ordinary maintenance. The bug was only exposed by a node that ran undisturbed long enough to hit the limit. Routine operational hygiene was masking a real attack surface.
  • Network-triggerable resource-exhaustion bugs are remote attacks. This bug is not “memory pressure on a sufficiently busy node” — it is the property that any peer can cost the victim a permanent kernel-level resource at zero cost to themselves. Pattern this is worth burning into review checklists for the entire networking stack.

— GRIDNET Emergency Response Team