Hacker News new | past | comments | ask | show | jobs | submit login
When can two TCP sockets share a local address? (cloudflare.com)
182 points by jgrahamc on March 20, 2023 | hide | past | favorite | 55 comments



I didn't really like the format of this article. The upfront quiz without much context or insight makes it feel like you're in class for an exam you didn't study for. The answers they provide are largely just Yes or No. When you get to the meat of the article, the quiz itself is forgotten until the conclusion. In fact, they even mention how useless the quiz is to provide insight at the beginning of the instructive content.

> Is it all clear now? Well, probably no. It feels like reverse engineering a black box. So what is happening behind the scenes? Let's take a look.

Keeping the quiz format, I think it would've been better to have it at the end then have the answers tie into the concepts you've learned. Or have the questions littered throughout the article and used to build upon the concepts it is teaching you. In its current state, I don't see this article's information having much staying power in my brain without concentrated study of the article.


I enjoyed the quiz personally. I was kinda expecting more complicated questions, eg nat, iptables rules selecting local ip / interface, weird networking edge cases and so on.


This article is meant for tech people and the purpose of the quiz is make you realize how little you know about the subject even as an expert. This is not a pop-sci youtube video


And it fails at that, all the format is is being annoying, and having to go thru links to random github gists is extra annoying if you didn't know the answer and want to just learn


Same. I don't want to go run python code to play your little quiz game, sorry.

I think the article would be better structured as an explanation of what a connection is, its 5-tuple, and then a dive into any surprising exceptions to the rule that there can be no ambiguity in where a packet is delivered, it must go to exactly one spot on a host. So as long as you can find an unambiguous subset of available 5-tuples, you're allowed to bind/connect.


The Python quiz code is easily read as identical to the equivalent syscalls. You can click through to the answers and read them out of the docstrings without ever running a Python interpreter (that's what I did).


It was 7am and I wasn't in the mood to hop back and forth between GH and the blog post trying a bunch of examples and piecing things together. I just wanted to read an interesting writeup. That's all.


> hop back and forth between GH and the blog post trying a bunch of examples

The hopping back and forth was intended (and suboptimal) but you were never supposed to try the examples to get the answers.


To each their own. I just didn't like the style. I provided constructive criticism in my initial comment.


I'm not defending the style, I'm saying that part of your criticism is mistaken.

The answers are there so you don't have to run it. Running it is an option if you want to experiment, not a part of taking the quiz.


agreed, the post is missing a clear explanation of what defines a TCP connection


Fair point, though gauging what is the knowledge-level of your audience is hard. Gotta pick what to include and what to omit if you want to keep it concise.


for sure, gauging your audience is difficult. Still, any piece of writing benefits from a good framing. Great post, but I think jumping right into the details loses X% of potential readers. Keep up the good work :)


The identity of a TCP/IP connection is a quintuple containing { src addr, src port, dest addr, dest port, proto }. It feels like the article would be better if before or after all the network stack spelunking and quizzing that it was mentioned somewhere. We don't need to rely on the implementation, though there's always interesting corner cases worth exploring.


What is "proto" in your tuple? If we're talking about TCP, then it's always TCP, isn't it?

I think that formulation was invented to explain the fact that TCP and UDP can have distinct otherwise identical sockets, but i don't think it's a great way to do that. TCP and UDP just have completely separate spaces of socket addresses, the same way TCP and NetBIOS or TCP and UNIX domain sockets do.


Well, when one says TCP/IP, they usually mean to include UDP and ICMP. Although ICMP doesn't have ports, so managing state is different.

UDP and TCP both use 4-tuples with the same information, so even though I think it's more common to have a separate table for UDP and TCP, you can conceptually consider it a 5-tuple. It's all a conceptual model, but I'd put protocol up front, {tcp, RemoteIP, LocalIP, RemotePort, LocalPort}, {udp, RemoteIP, LocalIP, RemotePort, LocalPort}, {unix, Path}, {netbios, IDontRememberHowItsAddressed}, {icmp, SomethingConfusing}, etc. If you can't handle multiple arity tuples, you could make a nested 2-tuple for tcp and udp, like {tcp, {RemoteIP ... }}. It's all just conceptual notation though, so there's tons of ways to do it (you'll see I differ in both names and ordering compared to the other commenters, but that's not actually significant either)


> {tcp, RemoteIP, LocalIP, RemotePort, LocalPort}

Does the concept of Remote/Local IP have to do/get introduced when you discuss NAT?


For the endpoint there is no difference (because Remote* are NATed and the endpoint see them pointed at the router, not the original source[0]), but for the router performing the NAT it matters and usually it's Reply*

[0] depends on the NAT type, SNAT always rewrite SourceIP (because the far system wouldn't know where to reply[1]), DNAT usually rewrite DestinationIP (because system wouldn't reply to the received packet addressed to IP which doesn't exist on the system).

[1] Thats why NAT is not a security boundary - it's not trivial but you can trigger a response for some system behind the NAT by writing a local (to that system) IP in SrcIP


I would use Remote and Local for host networking first of all; rather than src/dest, because when you send you're the src, and when you receive, you're the dst... you don't want to include both permutations in the table (unless you're both the source and the destination, ie: connecting to yourself).

For NAT, you need to have a way to calculate the 5-tuple for SideA when you have a 5-tuple from SideB, and vice versa; most often, that'll be a table lookup, either for the whole 5-tuple, or for 1:1 NAT, it could just be a lookup for the "Local" IP. In that case, maybe src and dest make more sense, and the NAT isn't really Local in my book.


Indeed - I fudged things a bit by talking about TCP there. It would have been clearer if I just discussed IP instead.

> TCP and UDP just have completely separate spaces of socket addresses

But so does SCTP, and ICMP and IGMP and ... -- so rather than enumerate the protocols we can just describe this property of IP.


Yeah, but for ICMP and so forth, ports aren't a thing. So you don't really have a universal 5-tuple.

For a router or other middlebox, or an OS kernel, to do things like outbound-initiated-flow firewall-rule exceptions correctly, it must keep N different flow-state tables, one per transport-layer (L4) protocol; where each flow-state table's "primary key" is over a set of columns unique to that table / L4 protocol.

TCP and UDP just happen to be both the best-known L4 protocols, and to both use {srcIP, srcPort, dstIP, dstPort} as their "primary key" for flows; but this doesn't hold for other L4 protocols.

(Which is in turn why L4 protocols "must" be handled in kernel-land, for kernel firewalls, traffic-shapers, etc. to work: L4 flow-state doesn't have a universal schema for these services to work with; and because these services are implemented in static-compiled languages, they have to be built with compile-time knowledge of each known L4 protocol, so that they can have concrete implementations for each L4 protocol written or generated for each service. There's no way to just bring in (through some hypothetical FUSE-like "userland L4 protocol server" abstraction) more L4 protocols, and expect those kernel facilities to work with them. [And all the same goes for ASICs in L4 network routers — only moreso.] Which is why we got the L4 protocol ossification we did. Modern protocols like SCTP and QUIC being implemented on top of UDP, is a direct result of there being no universal 5-tuple!)


You can handle L4 protocols in userspace. You can bind to a particular IP protocol number. You can even handle L3 in userspace.

Obviously if you do this you lose the ability for multiple applications to handle different "ports", unless you do the multiplexing in userspace as well.


You have a unique definition of "handle" that doesn't seem to include "your OS's kernel packet filter keeps working to pre-filter these packets based on an L4 understanding of them before handing them to userspace, or after being handed them by userspace."

Which, if your machine is acting as something like a router/NAT/firewall, is kind of... the entire point of the box being there in the communication path.


But the structure of those spaces can be different! The only structure IP imposes is that every packet has a source and destination address. It's up to each protocol whether it has port numbers (like TCP, UDP, and SCTP), or not (like ICMP and IGMP), or some other mechanism for identifying flows.


I agree that is super important and maybe worth mentioning but the point of the quiz is to demonstrate that Linux's implementation is actually more constrained than the traditional "unique src addr/port dst addr/port 4-tuple" (for TCP).


Yeah I feel like people are missing the point that it's not a 4-tuple thing, it's an ordering issue. Since the source port (sometimes) gets picked before the interface or destination is, you can get an EADDRNOTAVAIL result even when there's technically a potential for it to end-up as a unique 4-tuple. Doing the assignment in a different order or more explicitly can allow it to work by making sure that the kernel always knows it will be unique.


It's interesting design decision to preemptively fail when the connect might later be ambiguous. However in memory allocation, Linux will gladly allocate memory it doesn't have, return success and then later when you attempt to use it, it will kill the process with OOM killer.


Very late on this one :P But I think the difference is the fact that ports are unique, so if the kernel assigns you a bad source port by accident then it can't change it later. You would need to know this is what happened and bind to a new source port (and do you attempt automatic again?). Where-as with overcommit, your program doesn't know or care what particular memory page you're using, and the kernel can easily fix things up in the background (Ex. use swap, or kill some other process) without you ever knowing about it.


> However in memory allocation, Linux will gladly allocate memory it doesn't have, return success and then later when you attempt to use it, it will kill the process with OOM killer.

In userspace, with overcommit enabled, yes.

In the kernel, you will often do the opposite -- speculatively allocate memory first, then take locks or other serialization primitives, then attempt to insert the object into some container, but fail if there was a collision. (The idea is to keep the relatively slow memory allocation outside of the locked region.)


I'm surprised the blog post does not mention Cloudflare's own library, tubular [0], the "BSD socket API on steroids":

> The control plane for BPF socket lookup. Steers traffic that arrives via the tubes of the Internet to processes running on the machine. Its much more flexible than traditional BSD bind semantics:

> * You can bind to all ports on an IP

> * You can bind to a subnet instead of an IP

> * You can bind to all ports on a subnet

I played with it once and found it to be pretty awesome.

[0] https://github.com/cloudflare/tubular


Thanks for reminder. This was a blog post after we hit the connectx() troubles :)

Tubular/sk_lookup is about ingress.

This blog post is about connected sockets on “egress”.

Actually, bpf sk lookup could be used on egress, but its not quite yet implemented


Was hoping to see a summary, is there one?

Another case is with nat at internet boundary. As far as I can tell, Cisco/palo firewall only track local_ip:port so one public ip only gives you 64k connections top.


Let me try. You could expect that on linux host you can have 64K concurrent connections to single target ip. If you had two target ips you could expect 128k concurrent conns.

This is not how it works.

Especially when doing bind-before-connect trick to set source ip.

Linux internally tracks local ports in a hashtable and often forbids their reuse in surprising ways.

The most surprising is ordering. If you do bind-before-connect then on that port later connect() will not work.

The effect is that its very hard to achieve the 128k conncurrent in our two targets scenario


I've done similar exercises with FreeBSD. The tldr is, if you want to approach the connection limits[1], you need to use bind-before-connect for the bulk connections, and it's useful to have connect on a separate (small) port range, for the ancillary connections your machine might have. There's lots of other things you might want to do, such as divide your outgoing ports by thread/cpu so there's no conflicts; and if you're dividing by cpu, you should probably calculate the recieve side scaling (RSS) hash, most likely Toeplitz of the returning packets, so you work the socket from userspace on the same CPU it's going to be worked in kernelspace when it comes in.

Modern FreeBSD and presumably Linux are very good at scaling inbound connections, but if you need to scale outbound connections, you've got to do more of the work.

[1] Also, consider if you're in the client seat, you're more likely to be the one closing the connections, so you've got TIME_WAIT states on your side; depending on your rate of connection closing, you may have a significant number of those, clogging up your ports; or it may not be significant at all.


The NAT box is proxying all the traffic. To the server you're contacting, the NAT box is the address. Which leads to fun things when you try to embed the callback address into the line protocol between the machines. It doesn't work.

So the NAT box gets 64k local ports for the entire organization, not for your one computer. That's still a lot, but if you have one or two popular remotes you talk to, you only get less than 2x64k combinations (say, 64k + 32k). If you have 10 employees that's not a problem. If you have 50k and each service makes 2 connections, you can run out quickly.


It could be that the NAT box isn't so great, and just limits itself to 64k outgoing connections per IP. Really depends on how hard the developers wanted to work. TCP itself doesn't have that limitation, but implementations of it may.


NAT can be a feature or the whole product.

I would certainly hope there are boxes out there that present multiple IP addresses to the 'outside' network and avoid these problems. It's an old technology and there's been plenty of room for improvement. I haven't caught any consumer products (eg, access points, routers, firewalls) doing this, and a lot of small businesses... well sometimes my work network is better maintained than my home network, and sometimes not so much.


One of the best things about user-space TCP implementations is never needing to be an archaeologist, never trying to figure out the path-dependent sequence of accidents that leads to surprising and undocumented kernel behaviors.


But wouldn't the same issue apply to user-space TCP implementations too? User-space TCP implementations too could have "path-dependent sequence of accidents" which a power user might eventually need to figure out?


Yes but instead of being a 35-year-old accretion of mistakes, a user-space network stack is likely to be part of a more typical software lifecycle, that gets updated more easily and ultimately replaced. Also such things are dramatically easier to debug.


Happily Linux is at most a 32-year old accretion of mistakes (and I'm not completely confident 0.01 even had TCP).


Happily Linux is a good name for a fresh new distro.


What are arguments against user-space TCP?


It doesn't work out-of-the-box with any existing software, so it only applies to things you are building from scratch, or that are structured in such a way that adding an alternate network mode, or that would work with a dynamic library that shims the entire sockets API. Your user-mode stack won't be observable by any existing monitoring tools. Also, your process needs to execute with CAP_NET_RAW.


One argument is that it requires building on top of a raw socket, which can open you to all sorts of ancient vulnerabilities that have been patched in the battle-tested code running in the kernel, e.g. this recent ICMP remote code execution vulnerability [0] ("An attacker could send a low-level protocol error containing a fragmented IP packet inside another ICMP packet in its header to the target machine. To trigger the vulnerable code path, an application on the target must be bound to a raw socket") [1].

[0] Discussion: https://old.reddit.com/r/netsec/comments/11s80zo/cve20232341...

[1] Advisory: https://msrc.microsoft.com/update-guide/vulnerability/CVE-20...


    TAPIF=tap0; BR=br0; LOCAL_IP=192.whatever; MASK=24; REAL_IP=eth0

    sudo ip tuntap add dev $TAPIF mode tap user $(whoami)    
    sudo ip addr add $LOCAL_IP/$MASK dev $TAPIF
    sudo ip link set $TAPIF up
    sudo ip link add $BR type bridge
    sudo ip link set $TAPIF master $BR
    sudo ip link set $REAL_IF up
    sudo ip link set $REAL_IF master $BR

Means you don't have to add iptables rules to get the kernel tcp/ip stack to ignore packets meant for your program specific user level stack. Raw sockets require special permissions.

    sudo setcap cap_net_admin,cap_net_raw=eip my_prog_bin
ping needs this, for example.

    getcap $(which ping)
    /bin/ping = cap_net_raw+ep

But this underscores the real issue we face because enough people won't care about your security if its convenient for their programming that it isn't a barrier to acceptance.

You have to get the kernel to do things, probably using root privs, to get out of the way of your programs ip traffic now. The kernel will jump in and reject a syn or synack response meant for your program and its user level stack. You don't have do anything like that if your program calls socket() to get an fd and goes on in the usual manner from there.


If you are able, could you explain this in more detail? I find the description unparseable.

Reading the words I do understand, the raw socket aspect seems to be irrelevant right? The vulnerable code would be the the parser that incorrectly runs code based on invalid input? It might require raw socket to trigger the vulnerability, but perhaps it would not exist if it was not written in kernel C but rather in user space garbage collected code with a good type system (not sure what the vulnerability actually is).


Normally if app opens a port it is not allowed to by firewall or application permissions it will just get error, with RAW sockets kernel would need to parse packet before deciding that.

For example normally you will get permission denied when you try to listen on sub-1024 port on normal user.

I'd also imagine if kernel is doing any kind of connection tracking (so really anything with firewall), it would be more optimal to have that connection tracked in kernel vs decoding it and adding to conntrack table.

I guess some kind of half-RAW could be done in place, like say a socket where you define protocol and port but handle actual packets in userspace ?


Nice drgn shout-out 2/3 of the way down the page!

https://drgn.readthedocs.io/en/latest/index.html


meaningless seo quantum keyword


The content is good, the way it is written is terrible


Any pointers? What would you change?


Can someone TLDR this? interesting question but not worth reading all this to find out.

Can two IPs bind to the same IP / port on the server? Even if they come from the same client, my intuition says no.


I'm having a hard time understanding your question. But let me try. If none of these answer your question, you're going to need to provide some more concrete details.

If your sever has IPs 192.0.2.1 and 192.0.2.2. You can bind to 192.0.2.1:80 on one socket and 192.0.2.2:80 on another and listen for connections on both (or use it for outgoing connections, whatever). Or you could bind to 0.0.0.0:80 and listen for connections on either, as long as they don't have more specific bindings.

If your client has IP 198.51.100.1, it can connect from tcp 198.51.100.1:18490 to 192.0.2.1:80 in one socket, and from tcp 198.51.100.1:18490 to 192.0.2.2:80 in another. You could also do the same on UDP, UDP's ports are orthogonal to TCP's.

If you have another client with IP 198.51.100.2, it can also use local port 18490 to connect to both servers on port 80.

A server can accept a virtually unlimited number of connections on a given tcp port, but a maximum of 64k per client IP, and your operating system will have some limit on the number of sockets; you can often raise that to some function of physical memory (FreeBSD limits you to a maximum of one fd for each four pages of physical memory); although, it takes a very specific load to be able to max out the FDs. Something like HAProxy in tcp forwarding mode can do it, but most other applications are going to use enough CPU and memory that you'll run out of those before you run out of FDs.


i was thinking of "scenario #1" in the article.

you explained in well and i was wrong. i dug in a bit more and it looks like the linux kernel represents a connection as 4-tuple: source ip, source port, destination ip, destination port

which is what allows the kernel to accept multiple connections on the listening port.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: