Signs of Triviality

Opinions, mostly my own, on the importance of being and other things.
[homepage]  [blog]  [jschauma@netmeister.org]  [@jschauma]  [RSS]

IPv4 addresses are silly, inet_aton(3) doubly so.

October 28th, 2021

128 bit IPv6 addresses are cute and all, but how about... IPv?

$ curl -s -v -I http://3010966065296825858750772020886900491758623155804418\
276008087971185053087193329865127509253199563541586634156262274362119706864\
436314715016226499517535126475570205383122687361892587226408381694868597191\
4830816722015764794244138634937665528586884556100653009798956899
*   Trying 166.84.7.99:80...
* Connected to 301096606529682585875077202088690049175862315580441827600808\
797118505308719332986512750925319956354158663415626227436211970686443631471\
501622649951753512647557020538312268736189258722640838169486859719148308167\
22015764794244138634937665528586884556100653009798956899 (166.84.7.99) port 80 (#0)
> HEAD / HTTP/1.1
> Host: 3010966065296825858750772020886900491758623155804418276008087971185\
053087193329865127509253199563541586634156262274362119706864436314715016226\
499517535126475570205383122687361892587226408381694868597191483081672201576\
4794244138634937665528586884556100653009798956899
>

Wait... what? What's going on here? 301096...56899 is not a Fully -Qualified Domain Name (it's missing a TLD, for starters), but at 266 characters, it can't be a hostname label that, say, /etc/resolv.conf completes, either. It's not an IPv6 address, and it sure as heck doesn't look like an IPv4 address in dotted-decimal notation, either.

Yet somehow curl(1) translates this into 166.84.7.99, the IPv4 address of this very web server. And things get stranger still! For example, I just found out that apparently Amazon has registered π for themselves:

$ pi=$(echo "scale=8; 4*a(1)" | bc -l)
$ echo $pi
3.14159264
$ ping -c 1 $pi
PING ec2-3-216-13-160.compute-1.amazonaws.com (3.216.13.160): 56 data bytes

And we all know that Google DNS uses 8.8.8.8, while Cloudflare has 1.1.1.1, but now Google looks like it's trying to steal users from Cloudflare by offering their DNS via 010.010.010.010 as well:

$ curl https://010.010.010.010
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>302 Moved</TITLE></HEAD><BODY>
<H1>302 Moved</H1>
The document has moved
<A HREF="https://dns.google/">here</A>.
</BODY></HTML>
$

(Looks like somebody picked up this particular example on HackerNews.)

Alright, alright, I think you probably already know where this is going. All of the above are simply examples of a peculiar way of specifying IPv4 addresses that, for historical reasons, a specific library function accepts: inet_aton(3):

Values specified using the "dotted quad" notation take one of the follow-
ing forms:

      a.b.c.d
      a.b.c
      a.b
      a

This weird convention dates back to the days of classful networking, where the leading bits of an IPv4 address indicated the length of the network- and host portions of the address. And while we're used to specifying IPv4 addresses in full dotted-decimal notation (i.e., a.b.c.d), it's useful to remember that at the end of the day, an IPv4 address is just a 32-bit number, and we can express numbers in a variety of ways. inet_aton(3) accepts these different notations, so let's see what each of those means specifically.

Dotted-Decimal Addresses

When four parts are specified, each is interpreted as a byte of data and assigned, from left to right, to the four bytes of an Internet address.

Ok, not much to see here, that's what we expect -- something like 166.84.7.99 -- although we'll get back to some surprises with this format a bit further down. For now, let's move on.

Single part or number

When only one part is given, the value is stored directly in the network address without any byte rearrangement.

All numbers supplied as "parts" in a "dotted quad" notation may be decimal, octal, or hexadecimal, as specified in the C language (i.e., a leading 0x or 0X implies hexadecimal; otherwise, a leading 0 implies octal; otherwise, the number is interpreted as decimal).

Let's illustrate by example of our IPv4 address here on this web server: 166.84.7.99. First we convert this number to binary:

     166 .       84 .        7 .       99
10100110 . 01010100 . 00000111 . 01100011

Next, we glue the binary octets together and convert that number to decimal. (Because I'm lazy, I have a few shell functions to perform these manipulations, which you can find here.)

$ ipv4ToBinary 166.84.7.99
10100110010101000000011101100011
$ binaryToDec 10100110010101000000011101100011
2790524771

# just to confirm
$ decToBinary 2790524771 | binaryToIPv4
166.84.7.99
$ 

In other words, our dotted-decimal IPv4 address 166.84.7.99 is the same as 2790524771, and per the manual page above, inet_aton(3) should accept that, so let's give it a try:

$ ping -c 1 2790524771
PING 2790524771 (166.84.7.99): 56 data bytes
64 bytes from 166.84.7.99: icmp_seq=0 ttl=248 time=16.103 ms

--- 2790524771 ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 16.103/16.103/16.103/0.000 ms
$ 

Yep, looks like that works. And as promised, we can also use octal or hexadecimal representations of this number. Since we've established that it's inet_aton(3) that's doing the magic here, we don't need ping(1) or curl(1) any longer and instead just ask inet_aton(3) directly:

$ cat ip.c
#include <arpa/inet.h>

#include <ctype.h>
#include <err.h>
#include <stdio.h>
#include <stdlib.h>
#include <strings.h>

int main(int argc, char **argv) {
	if (argc != 2) {
		fprintf(stderr, "Usage: %s arg\n", argv[0]);
		exit(EXIT_FAILURE);
	}

	struct sockaddr_in sa;

	(void)memset(&sa, 0, sizeof(sa));
	sa.sin_family = AF_INET;
	sa.sin_len = sizeof(struct sockaddr_in);

	if (inet_aton(argv[1], &sa.sin_addr) != 0) {
		(void)printf("%s: %s\n", argv[1], inet_ntoa(sa.sin_addr));
	}
}
$ cc -Wall -Werror -Wextra ip.c -o aton
$ ./aton 0$( ( echo obase=8; echo 2790524771; ) | bc)    # leading 0 => octal
024625003543: 166.84.7.99
$ ./aton 0x$( ( echo obase=16; echo 2790524771; ) | bc)  # leading 0x => hex
0xA6540763: 166.84.7.99
$ 

In other words: 166.84.7.99 (dotted-decimal) == 2790524771 (decimal) == 024625003543 (octal) == 0xA6540763 (hexadecimal). They're all just different representations of the same 32 bit number.

So far, so good. But inet_aton(3) also accepts other formats, so let's explore those:

Two and three part addresses

When a two part address is supplied, the last part is interpreted as a 24-bit quantity and placed in the right most three bytes of the network address. This makes the two part address format convenient for specifying Class A network addresses as "net.host".

That is, if we specify a.b, then inet_aton(3) interprets a as the leading 8 bits of a classful network component of a Class A network, and b as the classful host component consisting of the remaining 24 bits.

So if we take our IPv4 address 166.84.7.99, we'd use a /8 CIDR notation to emulate the (theoretical) classful 166.0.0.0 network, on which we can then address hosts as 166.<host component>, where 166.1 == 166.0.0.1, 166.255 == 166.0.0.255, 166.256 == 166.0.1.0, and so on. To identify our full address, we again convert (the dotted host component) to binary and then to decimal:

$ ./aton 166.0
166.0: 166.0.0.0
$ ./aton 166.1
166.1: 166.0.0.1
$ ./aton 166.2
166.2: 166.0.0.2
$ ./aton 166.255
166.255: 166.0.0.255
$ ./aton 166.256
166.256: 166.0.1.0
$ ./aton 166.512
166.512: 166.0.2.0
$ ./aton 166.65536
166.65536: 166.1.0.0
$ ./aton 166.$(ipv4ToBinary 0.84.7.99 | binaryToDec)
166.5506915: 166.84.7.99

Similarly for three part addresses:

When a three part address is specified, the last part is interpreted as a 16-bit quantity and placed in the right-most two bytes of the network address. This makes the three part address format convenient for specifying Class B network addresses as "128.net.host".
$ ./aton 166.84.0
166.84.0: 166.84.0.0
$ ./aton 166.84.1
166.84.1: 166.84.0.1
$ ./aton 166.84.2
166.84.2: 166.84.0.2
$ ./aton 166.84.65535
166.84.65535: 166.84.255.255
$ ipv4ToBinary 0.0.7.99 | binaryToDec
1891
$ ./aton 166.84.1891
166.84.1891: 166.84.7.99

Mixing Formats

Finally, as promised by the manual page, all parts can be given in either decimal, octal, or hexadecimal formats. So the following are all representations of the same address:

$ ./aton 166.84.7.99              # decimal dotted
166.84.7.99: 166.84.7.99
$ ./aton 2790524771               # decimal
2790524771: 166.84.7.99
$ ./aton 024625003543             # octal 
024625003543: 166.84.7.99
$ ./aton 166.0x540763             # decimal.hex
166.0x540763: 166.84.7.99
$ ./aton 0246.84.07.0x63          # octal.decimal.octal.hex
0246.84.07.0x63: 166.84.7.99
$ 

Additional Examples

Ok, so now let's take a look again the examples I gave above. I'm sure you can already tell what's going on there.

In the example of Amazon owning π, we used 3.14159264, which we now see is merely a number in a.b format. We can find a few similar examples:

$ ./aton 3.14159264                      # pi
3.14159264: 3.216.13.160                                                                            
$ decToBinary 14159264 | binaryToIPv4
216.13.160
$ ./aton 2.7182818                       # e
2.7182818: 2.109.153.226
$ ./aton 1.4142135                                                                                  
1.4142135: 1.63.52.55                    # sqrt(2)
$ 

Amazon was assigned 3.128.0.0/9 by ARIN, Euler's number falls into 2.104.0.0/13, assigned to Tele Danmark, and the square root of 2 "belongs" to China Unicom via 1.56.0.0/13.

And Google... isn't trying to impersonate Cloudflare. 010.010.010.010 is 8.8.8.8 in octal octets, while 1.1.1.1 would be 01.01.01.01 (octal), 0x01010101 (hex), or, say, 1.1.257 (decimal).

But what about...

IPv

$ curl -s -v -I http://3010966065296825858750772020886900491758623155804418\
276008087971185053087193329865127509253199563541586634156262274362119706864\
436314715016226499517535126475570205383122687361892587226408381694868597191\
4830816722015764794244138634937665528586884556100653009798956899

Clearly 301096...56899 is not in any dotted format, so should be treated as a single number. It's neither hexadecimal (doesn't start with 0x), nor octal (doesn't start with 0), so would be interpreted as a regular decimal number. But how does inet_ntop(3) turn that into 166.84.7.99?

And what's more interesting: how do we even manage to use such a large number? The largest data type in C is uintmax_t, which is 8 bytes, the largest number thus UINTMAX_MAX:

$ cat /tmp/a.c
#include <inttypes.h>
#include <stdio.h>

int main() {
        printf("%lu\n", sizeof(uintmax_t));
        printf("%lu\n", UINTMAX_MAX);
}
$ cc -Wall -Werror -Wextra /tmp/a.c
$ ./a.out
8
18446744073709551615
$ 

But... wait a second. If the largest data type can only hold 8 * 8 bytes = 64 bits with a max value of 2^64, then how do we store 128-bit IPv6 addresses? Let's look at netinet/in.h and netinet6/in6.h real quick, where we find the struct in_addr and struct in6_addr defined respectively:

/* From <sys/ansi.h>: */
typedef __uint32_t      __in_addr_t;    /* IP(v4) address */

/*
 * Internet address (a structure for historical reasons)
 */
struct in_addr {
        in_addr_t s_addr;
} __packed;


struct in6_addr {
        union {
                __uint8_t   __u6_addr8[16];
                __uint16_t  __u6_addr16[8];
                uint32_t  __u6_addr32[4];
        } __u6_addr;                    /* 128-bit IP6 address */
}; 

Alright, so an IPv4 address is quite simply a 32-bit unsigned int, but an IPv6 address is stored as either 16 8-bit numbers, 8 16-bit words, or 4 32-bit numbers. The use of a union here allows you to store the same data in different formats, thereby making it easier for you to pull the parts you want out of the memory location:

$ dig +short aaaa panix.netmeister.org
2001:470:30:84:e276:63ff:fe72:3900

# As binary:
$ ipv6ToBinary 2001:470:30:84:e276:63ff:fe72:3900
001000000000000100000100011100000000000000110000 0000000010000100\
1110001001110110011000111111111111111110011100100011100100000000

# As 8 16-bit words:
0010000000000001 0000010001110000 0000000000110000 0000000010000100\
1110001001110110 0110001111111111 1111111001110010 0011100100000000
 converted to hex =>
            2001              470               30               84\
            e276             63ff             fe72             3900

The last option in the in6_addr union of using 4 32-bit numbers can be useful for IPv4-mapped IPv6 addresses, where the 32-bit IPv6 address is written in dotted-decimal after the IPv6 prefix ::ffff, for example:

::ffff:166.84.7.99 in 4 32-bit binary numbers:
00000000000000000000000000000000 00000000000000000000000000000000\
00000000000000001111111111111111 10100110010101000000011101100011

Ok, ok, nice rabbit hole, but where were we? Oh, right, we determined that an IPv4 address, stored as an unsigned 32-bit int cannot hold the gigantic number we used at the beginning of this blog post. What is the largest number it can hold, and what would that number be when handed to inet_aton(3)?

$ ./aton $(echo 2^32 - 1 | bc)
4294967295: 255.255.255.255

Ok, that makes sense: 2^32 - 1 becomes 4294967295 decimal, which, in dotted-decimal representation is the maximum IPv4 address 255.255.255.255. Now what happens when we increase that number?

$ ./aton 4294967296
4294967296: 0.0.0.0
$ ./aton 4294967297
4294967297: 0.0.0.1
$ ./aton 4294967298
4294967298: 0.0.0.2

That's right: we're wrapping around, since we're overflowing the data type. Note that because we are dealing with an unsigned int, this does not constitute "undefined behavior", as it would be if we used a signed data type!

ISO/IEC 9899:201x says:

A computation involving unsigned operands can never overflow, because a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting type.

So when inet_aton(3) parses the string "301096...56899", it will try to stuff the numerical value into a uint32_t as seen here:

uint32_t val;

val = 0; base = 10; digit = 0;
for (;;) {
        if (isascii(c) && isdigit((unsigned char)c)) {
                val = (val * base) + (c - '0');
                c = *++cp;
        }
}

addr->s_addr = htonl(val);

(Note: on Linux, since glibc-2.0.95 inet_aton(3) is implemented using strtoul(3), which fails with ERANGE, so doesn't modulo wrap the number and simply fails to accept it.)

When val becomes larger than UINT32_MAX, it can no longer be represented in val, so is reduced via a % UINT32_MAX operation:

(gdb) c
Continuing.

Watchpoint 2: val

Old value = 0
New value = 3
inet_aton (
    cp=0x7f7fff91cb0b "301096606529..."..., addr=0x7f7fff91c4d4) at ip.c:44
44                                      c = *++cp;
[...]

Old value = 301096606
New value = 3010966065
inet_aton (
    cp=0x7f7fff91cb14 "296825858750..."..., addr=0x7f7fff91c4d4) at ip.c:44
44                                      c = *++cp;

Old value = 3010966065
New value = 44889580
inet_aton (
    cp=0x7f7fff91cb15 "968258587507..."..., addr=0x7f7fff91c4d4) at ip.c:44
44                                      c = *++cp;

Old value = 44889580
New value = 448895809
inet_aton (
    cp=0x7f7fff91cb16 "682585875077..."..., addr=0x7f7fff91c4d4) at ip.c:44
44                                      c = *++cp;

After we correctly converted the first leading digits 3010966065, the next digit would have brought our total value to 30109660652, which is larger than UINT32_MAX = 4294967296, so we then calculate 30109660652 % 4294967296 = 44889580 and merrily continue on our way, adding the next digit's value, and the next, and so on.

We keep doing this over and over, wrapping the number several times until we end up with a number that is, after successive modulo wrapping, the UINT32_MAX remainder. Any number (A * X) % X = 0, so we can generate a really long number that then maps to the desired IP address by multiplying it by UINT32_MAX and then adding the decimal value of the IP address in question:

$ export UINT32_MAX=4294967296
$ echo "$(tr -dc '[:digit:]' </dev/urandom | head -c 256) * ${UINT32_MAX} + \
        $(ipv4ToDec 166.84.7.99)" | bc
30109660652968258587507720208869004917586231558044182760080879711850\
53087193329865127509253199563541586634156262274362119706864436314715\
01622649951753512647557020538312268736189258722640838169486859719148\
30816722015764794244138634937665528586884556100653009798956899

This long number then will be parsed and modulo wrapped to 2790524771, which inet_aton(3) then stores in the uint32_t, representing 166.84.7.99, which curl(1) then connects to.

However, if you were to run the command from above, you'd notice that my web server responds with a 400 Bad Request. At the same time, the following succeeds:

$ curl -v -I http://2790524771
*   Trying 166.84.7.99:80...
* Connected to 166.84.7.99 (166.84.7.99) port 80 (#0)
> HEAD / HTTP/1.1
> Host: 166.84.7.99
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 302 Found

vs

$ curl -s -v -I http://3010966065296825858750772020886900491758623155804418\
276008087971185053087193329865127509253199563541586634156262274362119706864\
436314715016226499517535126475570205383122687361892587226408381694868597191\
14830816722015764794244138634937665528586884556100653009798956899
*   Trying 166.84.7.99:80...
* Connected to 301096606529682585875077202088690049175862315580441827600808\
797118505308719332986512750925319956354158663415626227436211970686443631471\
501622649951753512647557020538312268736189258722640838169486859719148308167\
22015764794244138634937665528586884556100653009798956899 (166.84.7.99) port 80 (#0)
> HEAD / HTTP/1.1
> Host: 3010966065296825858750772020886900491758623155804418276008087971185\
053087193329865127509253199563541586634156262274362119706864436314715016226\
499517535126475570205383122687361892587226408381694868597191483081672201576\
4794244138634937665528586884556100653009798956899
>
< HTTP/1.1 400 Bad Request

The reason for this is that in the first case, curl(1) sets the Host: header to the IPv4 address (because it was able to parse that as an address itself?), but in the second case to the very long numerical string. Apache httpd does not seem to like such a string as a Host: header, referencing RFC3986:

Although the URI syntax for IPv4address only allows the common dotted-decimal form of IPv4 address literal, many implementations that process URIs make use of platform-dependent system routines, such as gethostbyname() and inet_aton(), to translate the string literal to an actual IP address. Unfortunately, such system routines often allow and process a much larger set of formats than those described in Section 3.2.2.

These additional IP address formats are not allowed in the URI syntax due to differences between platform implementations.

(If you'd like a proper 200 response, you can try talking to port 8080, where bozohttpd does not give a damn.)

Summary

Now all of this is largely esoteric and not something that you can or should really rely on. inet_aton(3) has inherited this behavior from the early days due to classful networking, which we haven't used in decades, and all of this only applies to IPv4, not IPv6. inet_pton(3), which you should be using anyway, does not encourage such shenanigans. However, because getaddrinfo(3) explicitly uses inet_aton(3) itself for AF_INET, citing RFC3493, you will encounter this behavior in various applications, and it may be used -- besides as a party trick for very peculiar parties -- as an obfuscation technique by e.g., malware, and thus is something that it's good to be aware of.

You know, the more you know...

October 28th, 2021


Links:


Previous: [What's in a hostname?]  -- Next: [IPC Buffer Sizes]
[homepage]  [blog]  [jschauma@netmeister.org]  [@jschauma]  [RSS]