某天在 Windows 宿主机上执行任务时,发现 wda 指令请求一直失败,查看日志发现唯一有效的错误日志是 Only one usage of each socket address (protocol/network address/port) is normally permitted,回顾宿主机环境在过去一段时间没有进行过变更,并且该问题是第一次出现,此前相同环境并没有出现过这个问题
尝试在其他宿主机以及本地开发机上执行相同命令均不能稳定复现该问题,其中 Linux 开发机无法复现该问题
请求的末端是 usbmuxd 服务,该服务负责和连接的 iPhone 进行通信,所有请求最后都会发送到 usbmuxd 来进行多路复用 (multiplexing),从而达到在一个 usb 链路上同时执行多个请求。
usbmuxd 服务在 Linux 和 Windows 上的表现并不相同,在 Linux 上,其监听一个 UNIX socket 套接字来提供服务,而在 Windows 上,其监听 127.0.0.1:27015 端口提供服务,这里应该是处于安全考虑,其只监听了 127.0.0.1:27015 而不是 0.0.0.0:27015。
由于 Windows 上的 usbmuxd 服务仅监听 127.0.0.1:27015 端口,对于容器内打出来的请求,考虑到其 ip 可能不是 127.0.0.1,遂在宿主机有起一个反代服务,从 0.0.0.0:23333 端口反向代理到 127.0.0.1:27015,这样不是以 127.0.0.1 发起请求的服务可以通过连接到宿主机的 23333 端口来和 usbmuxd 通信。在 Linux 上,同样有这个反代服务,不过它的作用是将请求代理到对应的 UNIX socket 上。
先关注第一个问题,通过检索搜索引擎 windows dynamic port range 可以检索到如下文档:
The default dynamic port range for TCP/IP has changed since Windows Vista and in Windows Server 2008:To comply with Internet Assigned Numbers Authority (IANA) recommendations, Microsoft has increased the dynamic client port range for outgoing connections in Windows Vista and Windows Server 2008. The new default start port is 49152, and the new default end port is 65535,也即 Windows 上的默认动态端口范围自 Windows Server 2008 开始,默认从 49152 开始到 65535,一共 16384 个,同时使用 netsh int ipv4 show dynamicport tcp 可以查看实际配置的值
Powershell 执行 netsh int ipv4 show dynamicport tcp,确实得到 16384 这个结果:
import subprocess import time import signal import sys
# Dictionary to keep track of dual tcp ip:port pair to pid tcp_map = {}
defupdate_tcp_map(): global tcp_map # Execute the command to get established connections command = "powershell -Command \"netstat -ano | findstr /V TIME_WAIT\"" result = subprocess.run(command, capture_output=True, text=True, shell=True) if result.returncode == 0: lines = result.stdout.strip().split('\n') for line in lines: parts = line.split() iflen(parts) >= 5: local_address = parts[1] remote_address = parts[2] pid = parts[4] tcp_map[(local_address,remote_address)] = pid
defcheck_time_wait_sockets(): global tcp_map # Execute the command to get TIME_WAIT sockets command = "powershell -Command \"netstat -ano | findstr TIME_WAIT | findstr /V :443 | findstr /V :80\"" result = subprocess.run(command, capture_output=True, text=True, shell=True) if result.returncode == 0: lines = result.stdout.strip().split('\n') for line in lines: parts = line.split() iflen(parts) >= 5: local_address = parts[1] remote_address = parts[2] pid = parts[4] # Check if the TIME_WAIT socket was previously tracked if (local_address,remote_address) in tcp_map: print(f"TIME_WAIT socket: {local_address} -> {remote_address}, previously tracked PID: {tcp_map[(local_address,remote_address)]}") # Remove the tracked record del tcp_map[(local_address,remote_address)]
import subprocess import time import signal import sys
# Dictionary to keep track of dual tcp ip:port pair to pid tcp_map = {}
defupdate_tcp_map(): global tcp_map # Execute the command to get established connections command = "netstat -anolp | grep -v TIME_WAIT" result = subprocess.run(command, capture_output=True, text=True, shell=True) if result.returncode == 0: lines = result.stdout.strip().split('\n') for line in lines: parts = line.split() iflen(parts) >= 7: local_address = parts[3] remote_address = parts[4] pid = parts[6] tcp_map[(local_address,remote_address)] = pid
defcheck_time_wait_sockets(): global tcp_map # Execute the command to get TIME_WAIT sockets command = "netstat -anolp | grep TIME_WAIT" result = subprocess.run(command, capture_output=True, text=True, shell=True) if result.returncode == 0: lines = result.stdout.strip().split('\n') for line in lines: parts = line.split() iflen(parts) >= 7: local_address = parts[3] remote_address = parts[4] # Check if the TIME_WAIT socket was previously tracked if (local_address,remote_address) in tcp_map: print(f"TIME_WAIT socket: {local_address} -> {remote_address}, previously tracked PID: {tcp_map[(local_address,remote_address)]}") # Remove the tracked record del tcp_map[(local_address,remote_address)]
第三点令人感到疑惑:为什么从 docker 容器内发出的请求实际发起 ip 为 127.0.0.1,并且是由 docker 进程发起的?
由于 Linux 下,请求的发起 ip 均为 docker bridge 网卡下容器的 ip,可以推断 Linux 下容器内发起的请求不会占用宿主机的动态端口。而如果 Windows 会由 docker 代为发起请求,从现象上看起来确实是会占用宿主机的端口范围,这很有可能是 Linux 上无法复现这个问题的,除了 Linux 上与 usbmuxd 通信不使用 tcp 连接之外的另一个关键原因。
至此,可以产生一些阶段性的结论:
工程逻辑一定是存在问题需要优化,8qps 请求压力不合理。
问题大概率是由于端口耗尽问题引起,Linux 由于 docker 容器网络与 Windows 原理不同不占用宿主机端口,同时与 usbmuxd 也不占用端口,因此不存在这个问题。
接下来,更多的是验证上述结论的正确性,目光转向下列方向:
如果确定是端口耗尽导致的问题,手动构造场景是否可以复现这个问题?
工程是否真的有问题?因为也有可能是发起请求的组件存在问题。
bridge 模式在 Windows 上发起请求时的行为是否真的与上述一致?是否可以复现?是否有文档支撑?
classConnectionManager: def__init__(self): self.connections = [] defopen_connections(self, number): """Open a specified number of TCP connections.""" for _ inrange(number): try: client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM) # client_socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) # Allow reuse of the address client_socket.connect(('127.0.0.1', 23335)) self.connections.append(client_socket) print(f"Opened connection {len(self.connections)}") except Exception as e: print(f"Failed to open connection: {e}") defclose_connections(self, number): """Close a specified number of TCP connections.""" to_close = min(number, len(self.connections)) for _ inrange(to_close): client_socket = self.connections.pop() client_socket.close() print(f"Closed connection, {len(self.connections)} remaining")
deflisten_for_input(connection_manager): """Listens for user input to open/close connections.""" whileTrue: user_input = input("Enter command (open <number> / close <number>): ") parts = user_input.split() iflen(parts) != 2: print("Invalid command. Use 'open <number>' or 'close <number>'.") continue command, number_str = parts try: number = int(number_str) except ValueError: print("Please enter a valid number.") continue if command == "open": connection_manager.open_connections(number) elif command == "close": connection_manager.close_connections(number) else: print("Unknown command. Use 'open' or 'close'.")
if __name__ == "__main__": connection_manager = ConnectionManager() try: listen_for_input(connection_manager) except KeyboardInterrupt: print("\nExiting client.") for conn in connection_manager.connections: conn.close() sys.exit(0)
测试发现,当打开了 16000 个端口左右再全部 close 后,问题确实可以复现。
为了交叉验证,在等待一段时间 (大概相当于第一个 open 的端口关闭 2 分钟后)问题自行消失并且在不另开新连接时不再复现。
I'm using docker on Windows, and I'm aware of a weird behavior:
When I use host.docker.internal to access the host port, say, I try to connect to a tcp server listening on 0.0.0.0:23333 on host machine, I open a socket inside the container with the container ip:port as one end and host.docker.internal:23333 as the other.
However, when I check the open connections on host, it shows that the just opened socket has 127.0.0.1:another port as on end and 127.0.0.1:23333 as the other. It seems that docker act as an intermediate, accept the connection to host.docker.internal and reopen one on host. Is this real and expected? Why is this behavior different than that on linux?
Also, I've checked ipconfig on Windows and ifconfig on WSL, I can't see any additional interfaces like `veth` or `docker0` or `br-xxx` as these interfaces exists on Linux
------ [Answer]
Yes, your observation about Docker's behavior on Windows is correct and expected due to the differences in how Docker is implemented on Windows compared to Linux.
### Docker Networking on Windows vs. Linux
On Linux, Docker uses the host's kernel features directly, such as namespaces and iptables, to manage containers. It creates virtual network interfaces (like `veth` pairs), bridges (`docker0`), and uses the native Linux networking stack. This setup allows containers to communicate with the host and each other in a manner that closely resembles how separate physical machines would interact on a network.
On Windows, Docker operates differently because Windows does not natively support the same kernel-level features as Linux. Docker on Windows has two main modes: using Hyper-V (with Docker Desktop) or using the Windows Subsystem for Linux (WSL2). Each of these modes handles networking in its own way:
1. **Hyper-V**: When using Docker Desktop with Hyper-V, Docker runs containers inside a lightweight virtual machine. All Docker networking goes through this VM. The `host.docker.internal` DNS name resolves to an internal gateway IP that Docker sets up to route traffic from the container to the host. This is why you see connections appearing to come from `127.0.0.1` on different ports; Docker is essentially NATing (Network Address Translation) the traffic between the container and the host.
2. **WSL2**: In this mode, Docker utilizes a utility VM that runs a real Linux kernel, and Docker containers are managed much like they are on a native Linux system. However, the networking is still handled through a translation layer between Windows and the Linux kernel in WSL2, leading to similar behaviors as seen with Hyper-V.
### Why the Difference?
The primary reason for this difference is the underlying OS architecture and the lack of native container support in Windows. Docker must rely on virtualization and network translation to enable similar functionalities, which inherently changes how networking behaves.
### Lack of `veth`, `docker0`, etc., on Windows
As you've noticed, network interfaces like `veth`, `docker0`, or `br-xxx` are not visible on Windows. This is because these interfaces are part of the Linux networking architecture, which Windows does not natively support. Docker on Windows uses other methods (like creating a virtual network inside Hyper-V or leveraging the network stack in WSL2) to handle container networking.
### Conclusion
The behavior you're seeing is expected given the architectural differences in how Docker operates on Windows versus Linux. Docker must adapt to the underlying operating system's capabilities, leading to different networking setups and behaviors. If you need more consistent behavior across different OSes, you might consider using Linux VMs on Windows or deploying your Docker containers on native Linux systems.
总结
问题的根因是工程逻辑变更后请求次数异常增加导致,直接原因是由于请求次数大量增加后,Windows 上动态端口耗尽触发了该报错,同时,docker 在 Windows 上的网络行为也进一步加剧了请求数量增长的影响(导致请求数量翻倍)。
由于 usbmuxd 和 docker 在 Windows 和 Linux 上行为的差异性导致了这个问题在 Linux 机器上无法复现,同时由于同一时间宿主机上任务负载量不同,导致了在同一个宿主机上也不能够稳定复现。通过手动构造极端场景可以做到稳定复现。
解决方案
找到了问题的原因,解决方案也就很明了了:
优先级最高的就是直接干掉工程内部的逻辑了,通过优化掉工程的错误调用可以快速解决目前问题
考虑到后面宿主机仍有负载增长空间,代码编写也仍然不一定完美,可以通过调整 Windows 动态端口范围和直接从容器内访问 usbmuxd 服务来将允许的请求并发量提高大约 4 倍。