From eBower Wiki
Jump to: navigation, search

Overview

Knowing the IP address of a client connecting to your server is important for a variety of reasons, but you also need to be careful about how you use this information because some uses are pretty dangerous in an increasingly mobile world. In most cases for small infrastructures you know the client IP address because you see it directly, but what happens when something is in the way and obscuring the client IP address? In many cases you're at the mercy of whoever owns that something, this article describes some of your options if you own (or pay for) that something.

I'll use the term "Client IP" to describe the public IP address of a particular client. This isn't their actual IP address, which is likely a private IP handed out by their router, but it's the closest we can get to their entry point into the public Internet. Conversely, I'll use the terms "Source IP" or "Apparent IP" to describe the IP address you see at your server. This is the source IP from the actual TCP connection that gets established, or in PHP terms $_SERVER['REMOTE_ADDR'];

Reliability of the Client IP

The first thing you'll want to do is consider *why* you want to know the client's actual IP address, it's not the magic unique and permanent identifier on the Internet that people think it is. Always treat an IP address as valid for the duration of a TCP connection, not the duration of a login session. And never, ever assume it will be the same user next week as it is this week.

  • First of all, we've got proxies. Corporate proxies aren't much of a concern here, that's not much different from a corporate NAT where a lot of people use the same IP address. Public proxies are a concern. For mostly-legitimate use cases (using a VPN to encrypt the first-mile link on public WiFi networks, getting around restrictive corporate or governmental firewalls) we care about offering support to these users. However, in most cases this means that we also don't know their IP address. Luckily in most cases it also means the IP address we think they have is pretty stationary. Except those who use Tor. Tor will randomize the exit node for the user so the apparent source IP address could move around. You may feel that proxies like this aren't worth supporting, the percentage of legitimate users accessing your site through them is low and the percentage of users who use them to attack your site is high. It's hard to argue with those numbers.
  • It used to be that when a cablemodem or DSL modem rebooted it always got the same IPv4 address, those days are behind us with the current IPv4 crunch. It's still a long shot that a lease is up or a device reboots and the session is maintained so that may be an acceptable loss. But what about mobile users?
  • An increasing number of users are using mobile devices to connect to your site; these devices change IP addresses frequently, move between WiFi and carrier networks, and even flip between IPv6 connectivity over LTE and IPv4 connectivity over 3G. Assuming the login IP never changes alienates these users and this is something nobody can afford to do as we cross the point where smart mobile devices outnumber PCs.
  • IPv6 brings a special consideration, at the time of this writing it's only 2% of your users but by YE2014 it's projected to be closer to 5% and more than double annually. If you're not IPv6 now, you will be soon or you'll be stuck with users behind SuperNATs where you can't see the client's IP address and you also can't control whether they maintain the same public IP over time. IPv6 has the benefit of identifying an individual user behind what would be a NAT today. However, to counter privacy concerns around that, IPv6 stacks often collect IPv6 addresses so periodically connections can be made with a new address from the same subnet. My laptop right now has 7 IPv6 addresses. New connections are made with the most recent address, but the old ones persist in case I have any old connections. Rotating IPv6 addresses happens very frequently.
  • Finally, we've got SuperNATs, a plague about to beset us all. We're all used to home NAT devices, for the better part of two decades we've had to share an IP address with everyone in the household. These NATs aren't bad, you've usually only got a single user accessing any one site behind the NAT. Corporate NATs are a bit worse, especially for B2B applications, but still it's controllable because end users have access to change the NAT behavior as needed. IPv4 exhaustion is causing carriers like BT to start deploying carrier grade NATs. Here you've got thousands or tens of thousands of users sharing a small block of public IP addresses and you're relying on a carrier with millions of customers to listen to your userbase to change a major component of their network if things misbehave. Relying on a client's source IP address to remain static means blind faith that the carrier knows what they're doing.

Client IP Visibility Use Cases

Some use cases are better than others, and they have different requirements.

  • Marketing Data You want to estimate the number of unique people hitting your site, where they're from using a geolocation database, and which pages they're visiting. This is all useful information because marketing is a soft science so a little slop in the results isn't going to change the net effect. More importantly, it's also not something that needs to be done in realtime so often the best scenario here is not knowing the client IP directly, but using log files and scripts to extract the data you want.
  • Troubleshooting A customer calls up and says they have a problem with your site. You need to trace back where the issue is and for that you need an IP address. Here again logs are probably sufficient, but this is edging into the real-time category since troubleshooting with a customer on the phone doesn't lend itself well to looking for an IP in your SSL termination box so you can correlate it with a connection on the application server.
  • Attack Mitigation Let's say you're under attack and you want to block certain IP addresses. This is a perfect use cases where you need to know the client's actual IP address immediately upon connection establishment so you can mitigate the attack. However, there are some risks here. First is around proxies and SuperNATs, you may have both attackers and legitimate users behind the same IP address. Under duress this is often considered to be acceptable losses, but if possible architecting your application to use something at the application layer (like a cookie) to identify users is better. Second, you never want to block IP addresses permanently unless you control them. A home IP address's lifespan is on the order of months if they have an IPv4 address, if they're behind a SuperNAT with session persistence it may be hours. Eventually you'll be blocking legitimate users and you have to ask yourself how many of them will call the helpdesk to get unblocked and how many will go to your competitors instead.
  • Login Validation The theory goes that if you log in from address 2001:1234:5678::abcd that address will never change. This is not a good assumption to make by a long shot and if you want to know why you should check out the preceding section.

Why the Source IP isn't Enough

In most cases it is, but sometimes things get put in place that cause you to hide the effective IP address of a significant portion of your customers.

  • SSL Termination Boxes You've got a complicated application, it's got ten different hostnames, runs over SSL, and it's one of a hundred apps in your datacenter. You asked your ISP for another /24 and they told you to get in line. What do you do? Well, until XP finishes dying its slow and painful death and you can rely on your userbase to have SNI you need SAN or Wildcard certs. The former allow you to have a single cert with a canonical list of hostnames associated with it, the latter allows you to have any host under a domain covered by the same cert and is a little more difficult to pull off. Both of these can terminate on a single SSL termination device, either using your favorite web server as a proxy or buying a hardware appliance with more efficient decryption. Now your 10 hostnames can have RFC1918 addresses and you only need to expose one IP address to the world. But all of your connections are coming from your SSL termination box at 10.2.3.4.
  • Content Delivery Networks You may use a CDN that onboards customers using DNS. Here the TCP connection is between the end user and the CDN and the CDN makes a new TCP connection from itself to your origin. This again obscures the client's actual IP address.
  • TCP-Layer/IP-Layer Proxies, Scrubbers and Load Balancers Some carriers or third parties offer services that terminate at the IP or TCP layers for security or performance. These services could be architected invisibly and allow for passing the packets through with the source IP intact, but services that terminate at these layers often provide better functionality at the expense of hiding the source IP address.

Getting the Actual Client IP

Now for the meat of the article, how do we extract the client IP address. By now I'm assuming you know that the client's IP address is not the source IP address and that you've validated that the reason why you need the client IP is actually valid. Here's how to get it done.

X-Forwarded-For Header

In many of the use cases it's pretty easy to get the client IP address. If you've got a web-based service and have a CDN or SSL termination box worth its salt there's this magic thing called an X-Forwarded-For or XFF header. This is an HTTP header that a proxy of any type MAY insert into the HTTP overhead. The MAY is important, an anonymous proxy will not do this (but many public proxies outside of your control do). More importantly, there is no validation here - the XFF header isn't signed by the proxy using some cert registered at a CA. If the XFF header exists, you can trust that it will likely be valid for legitimate users but you should be careful about relying on it too heavily for malicious users. Many CDNs will offer some mechanism to validate that they, in fact, added the XFF header - you should contact them to make sure about that. Of course, if your own box is inserting the XFF header you can also set your own rules about what to do when there's an XFF header coming in to you. In a trusting world the box should tack the source IP onto the end of the header and just send it to your server as an additional link in the proxy chain, but this implies that the original client IP wasn't spoofed. The only IP address you can really trust is the last proxy in this case and many implementations throw out pre-existing XFF headers and only rely on the source IP coming in.

Cookies and Beacons

I won't describe this in too much detail here because it will depend a lot on what your application looks like, but I can describe some high-level strategies. Let's assume that you're intentionally putting something between your end users and your server. Now let's assume that you can access the server directly as well (or at least another server that can communicate with your application). If a user connects to you at www.your-domain.com via this proxy you can generate on the server-side a unique identifier. You can then use JavaScript to send that identifier to www-direct.your-domain.com. By remembering the source IP and port that connected to your server for identifier 1234 you can correlate this session to the IP address that hit you on www-direct.your-domain.com. You can also do a similar thing with cookies which have the benefit of allowing you to track across the entire session. You can also do this by simply hitting the www-direct host on every page and downloading a "pixel" - a tiny object previously just a single pixel image that never gets displayed. Now the bulk of your traffic is going through the proxy but this tiny amount of data is being passed through a direct mechanism to track the user's IP address - problem solved!

As long as the user doesn't disable Javascript, cookies, or use something other than a traditional browser to view your page. In many cases this would be a show-stopper for your application anyway, if you can't keep track of a session cookie across two hostnames you own you're probably not able to keep that user logged in. More importantly, what if you're not running a web server? It's hard to tell ssh to connect to one IP address but also toss a few packets towards another just so we can correlate the connections together.

draft-williams-exp-tcp-host-id-opt

The previous two examples work well for HTTP-based services passing through SSL termination boxes or CDNs that terminate at the HTTP layer. But what about TCP-layer scrubbers? Or if you don't want to share your certs with a CDN and can only accelerate at the TCP layer? Or if you don't have an HTTP-based app at all? This is where something like draft-williams-exp-tcp-host-id-opt comes into play. It utilizes an experimental option space in the TCP overhead to pass the client IP address.

Support for this is a bit tricky. If you are using a hardware SSL termination device, F5 has a tutorial on how to access the TCP option space using Akamai's IPA implementation as an example. However, by the time the connection reaches Apache the TCP layer has been terminated and the options involved are lost so without some custom kernel work or an F5 device it may be a bit hard to test that this is working. Luckily, tcpdump works just fine to make sure you're seeing what you should be seeing.

Using tcpdump to Validate draft-williams-exp-tcp-host-id-opt

So, you've got a spiffy new TCP option appearing in your server's connections and you want to make sure that it's correct. The first thing you'll need is a known IP address, your home address is usually pretty reasonable for this and if it's not then you can try from your phone or from any nearby public WiFi hotspot. You can figure out what your IP address is by simply asking Google "what is my ip address?".

Now, on your server you'll want to run tcpdump to capture packets.

tcpdump -i eth0 '(tcp dst port 80 or tcp dst port 443) and tcp[tcpflags] & tcp-syn != 0' -w overlay_test.pcap

This is assuming that you're running a web server on interface eth0 using ports 80 and 443. I only capture the SYN packets because the standard only really cares about the beginning of the connection so the IP address doesn't take up valuable option space real estate. I'm also writing this to a file since it's a lot easier to explore it offline (we'll do this in realtime later). Now make a few connections to the server through your proxy and shut down the tcpdump.

We can take a look at the tcpdump using:

tcpdump -x -n -r overlay_test.pcap

We want to show the raw data of the header (-x) and we don't care about the rDNS lookups on the IPs (-n). Now, it's important to note that different flavors of tcpdump may format output differently so your mileage may vary. The tcpdump output below is from Ubuntu Trusty.

20:40:24.901308 IP 209.139.35.113.56165 > 172.16.13.80.80: Flags [S], seq 1086200143, win 8208, options [mss 1368,sackOK,TS val 1671783595 ecr 0,nop,wscale 7,exp-0348], length 0
	0x0000:  4500 0044 edfa 0000 3606 e85c d18b 2371
	0x0010:  ac10 0d50 db65 0050 40be 194f 0000 0000
	0x0020:  c002 2010 ca34 0000 0204 0558 0402 080a
	0x0030:  63a5 64ab 0000 0000 0103 0307 fd08 0348
	0x0040:  36f6 fa0f

The important thing we're looking for is the "[exp-0348]" which means that the appropriate experimental option number is there. Checking the last bytes of the header we can break this down as follows:

  • fd Kind 253, essentially this is a special TCP option that says "see the first field after the length to figure out what option I am." Note that "fe" or 254 is also a valid value for this position. In general implementations will only use one or the other, but it's a good use case for making the value a variable.
  • 07 The option is a total of 7 bytes
  • 0348 This is experimental option 0348 which has been assigned to this draft.
  • 36f6 fa0f This is the IP address which translates to 0x36.0xf6.0xfa.0x0f or 54.246.250.15. You can also convert this using online tools like this one.

If you've got an IPv6 address, the only delta is the length:

20:10:22.947083 IP 69.31.88.242.59732 > 172.16.13.80.80: Flags [S], seq 2155631820, win 8064, options [mss 1344,sackOK,TS val 1669981658 ecr 0,nop,wscale 7,exp-0348], length 0
	0x0000:  4500 0050 af95 0000 3106 82a1 451f 58f2
	0x0010:  ac10 0d50 e954 0050 807c 54cc 0000 0000
	0x0020:  f002 1f80 020b 0000 0204 0540 0402 080a
	0x0030:  6389 e5da 0000 0000 0103 0307 fd14 0348
	0x0040:  2001 0470 1f07 0a86 5cd4 3ecd 7940 12d4

We're also using option 253/0xfd here so this is the segment we're looking at fd14 0348 2001 0470 1f07 0a86 5cd4 3ecd 7940 12d4.

  • fd Kind 253.
  • 14 The option is a total of 0x14 or 20 bytes.
  • 0348 This is experimental option 0348 which has been assigned to this draft.
  • 2001 0470 1f07 0a86 5cd4 3ecd 7940 12d4 This is the IP address which is much easier to translate to 2001:0470:1f07:0a86:5cd4:3ecd:7940:12d4.

Note that if you want to see other variants of the source IP, you can install sipcalc:

$ sipcalc 2001:0470:1f07:0a86:5805:3b73:5650:517e
-[ipv6 : 2001:0470:1f07:0a86:5805:3b73:5650:517e] - 0

[IPV6 INFO]
Expanded Address	- 2001:0470:1f07:0a86:5805:3b73:5650:517e
Compressed address	- 2001:470:1f07:a86:5805:3b73:5650:517e
Subnet prefix (masked)	- 2001:470:1f07:a86:5805:3b73:5650:517e/128
Address ID (masked)	- 0:0:0:0:0:0:0:0/128
Prefix address		- ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff
Prefix length		- 128
Address type		- Aggregatable Global Unicast Addresses
Network range		- 2001:0470:1f07:0a86:5805:3b73:5650:517e -
			  2001:0470:1f07:0a86:5805:3b73:5650:517e

-

Note, if you see a bunch of nop entities at the end of the packet like this, you probably had something strip out the option. You should walk up the path to see if you can identify the firewall responsible:

20:40:24.901308 IP 209.139.35.113.56165 > 172.16.13.80.80: Flags [S], seq 1086200143, win 8208, options [mss 1368,sackOK,TS val 1671783595 ecr 0,nop,wscale 7,exp-0348], length 0
	0x0000:  4500 0044 edfa 0000 3606 e85c d18b 2371
	0x0010:  ac10 0d50 db65 0050 40be 194f 0000 0000
	0x0020:  c002 2010 ca34 0000 0204 0558 0402 080a
	0x0030:  63a5 64ab 0000 0000 0103 0307 0000 0000
	0x0040:  0000 0000

draft-williams-overlaypath-ip-tcp-rfc (deprecated)

This is an older spec that used a hijacked non-experimental TCP option. There are a couple of options here, but really only two make sense. We want information at the IP layer so we want to use overhead at the IP layer to obtain this information. This works great for IPv6 connections between the proxy/CDN and your server so it's the way you should go in the long run, but what if your server only supports IPv4? The problem with embedding the client IP into the IPv4 option space is that traditionally routers don't like IPv4 options and will often drop packets containing them. If you own the infrastructure you can ensure this isn't the case, but if you have to pass the data over the Internet or through third-party routers the IP option space isn't a viable plan. For IPv4 connections to your server you're pretty much stuck with the TCP option space. Example output from the same tcpdump we used above looks like this:

08:55:30.121413 IP 184.86.99.115.56084 > 72.246.44.148.443: Flags [S], seq 3228531619, win 8208, options [mss 1368,sackOK,TS val 157815181 ecr 0,nop,wscale 7,nop,uto0x136 155[len 7],[bad opt]>
	0x0000:  4500 0044 6cf4 0000 3606 866c b856 6373
	0x0010:  48f6 2c94 db14 01bb c06f 7ba3 0000 0000
	0x0020:  c002 2010 09f5 0000 0204 0558 0402 080a
	0x0030:  0968 118d 0000 0000 0103 0307 011c 0701
	0x0040:  36f6 fa0f

The important thing we're looking for is the "[bad opt]" which means that there is an option that tcpdump doesn't know about or is incorrect - tcpdump doesn't support the draft because it's not an RFC yet and no option number has been assigned. This also means that we need to make sure that we know what option to expect, in this case we're using the seldom-in-production option 28. 28 translates to 0x1c in hex so we want 1c 0701 36f6 fa0f. We can break this down as follows:

  • 1c Option 28, this should definitely be defined as a variable since there is no standard option number for this and it must be a hijacked value.
  • 07 We want a total of 7 bytes
  • 01 This is version 1. Note that this could also be represented as 2, which would be binary 000 = IPv4 and 00010 = version 2 but most implementations just use v1 for IPv4 and v2 for IPv6.
  • 36f6fa0f This is the IP address which translates to 0x36.0xf6.0xfa.0x0f or 54.246.250.15. You can also convert this using online tools like this one.

This works well for IPv4 sources, but what if you've got an IPv6 client connecting to your proxy but an IPv4 connection to your server? That's where v2 comes into play.

09:03:31.969233 IP 204.245.143.115.32086 > 72.246.44.148.80: Flags [S], seq 1885142044, win 8064, options [mss 1344,sackOK,TS val 158296999 ecr 0,nop,wscale 7,nop,uto0x2220 4368[len 19],nop,[bad opt]>
        0x0000:  4500 0050 d849 0000 3906 d76b ccf5 8f73
        0x0010:  48f6 2c94 7d56 0050 705c fc1c 0000 0000
        0x0020:  f002 1f80 0a33 0000 0204 0540 0402 080a
        0x0030:  096f 6ba7 0000 0000 0103 0307 011c 1322
        0x0040:  2001 0470 1f07 0a86 5805 3b73 5650 517e

We're also using option 28/0x1c here so this is the segment we're looking at 1c 1322 2001 0470 1f07 0a86 5805 3b73 5650 517e.

  • 1c Option 28
  • 13 It's 0x13 or 19 bytes long.
  • 22 This is actually two fields, three bits followed by five bits 001 00010. A 1 in the first three bits means it's IPv6 and a 2 in the second five means it's version 2 of the draft implementation.
  • 2001 0470 1f07 0a86 5805 3b73 5650 517e Translating IPv6 addresses is easy. 2001:0470:1f07:0a86:5805:3b73:5650:517e

Getting Data in Real Time

First of all, the best way to do this is using an appliance that supports this standard. For web services a great solution is to extract the client IP and insert it into the XFF header (assuming the appliance terminates the SSL as well). But let's say you can't do that for whatever reason. We can leverage tcpdump to create a script that will take an input of the apparent source IP/port and produce an output of the actual client IP address. This is most certainly not something that is designed for a heavy-use server but more for a proof of concept since there's a lot of work that gets done whenever you get a new connection - I run the in a VM that only gets a handful of connections per second.

I should parse each value in the header explicitly. Instead, I assume that the odds I capture a SYN packet which contains the source IP and port in the right spots as a string as well as the draft's option header as a string and occurs after I receive the actual SYN is rare. This also makes things faster since grep handles the substring searches a lot faster than I could parse a packet in a shell script.

The first step is to start capturing packets. To keep this up most of the time I put this script into a crontab - make sure you run it as an appropriate user since tcpdump's output tends to be very strictly permissioned:

#!/bin/bash
 
if [ "$(ps -e |grep tcpdump)" = "" ]; then
  outfile=/tmp/overlay.pcap
  tcpdump -i eth0 '(tcp dst port 80 or tcp dst port 443) and tcp[tcpflags] &  tcp-syn != 0' -U -W 2 -C 1 -w ${outfile} &
fi

This checks to see if tcpdump is already running. If it isn't, it creates /tmp/overlay.pcap0 and /tmp/overlay.pcap1. It will capture packets on eth0 destined for ports 80 and 443 (change this to your service, the dst also precludes SYN/ACK packets), ignore anything but SYNs, write to the file after each packet instead of buffering (-U), create two files (-W 2), and limit each file to 1MB (-C 1). You may need to play with the -C value, but the premise is that I can capture 2MB of data in two files.

Now I need to parse the data. For this I have a grab_client_ip script that I run:

#!/bin/bash
 
# This should be the filename we specify in the tcpdump cron
filebase=/tmp/overlay.pcap
 
function hex2quad {
  # Convert the hex version of an IP address into dotted quad notation.
  # This could be replaced with sipcalc
  full_string=$(echo $1 | tr '[a-z]' '[A-Z]')
  if [ ! "$full_string" = "" ]; then
    first_octet=$(echo 'ibase=16;obase=A;'$(echo ${full_string:0:2}) | bc)
    second_octet=$(echo 'ibase=16;obase=A;'$(echo ${full_string:2:2}) | bc)
    third_octet=$(echo 'ibase=16;obase=A;'$(echo ${full_string:4:2}) | bc)
    forth_octet=$(echo 'ibase=16;obase=A;'$(echo ${full_string:6:2}) | bc)
    echo $first_octet.$second_octet.$third_octet.$forth_octet
  fi
}
 
function quad2hex {
  # Convert the dotted quad version of an IP address into hex.
  # This could be replaced with sipcalc
  full_string=$1
  if [ ! "$full_string" = "" ]; then
    first_octet=$(echo 'ibase=A;obase=16;'$(echo $full_string | awk -F\. '{print $1}') | bc)
    second_octet=$(echo 'ibase=A;obase=16;'$(echo $full_string | awk -F\. '{print $2}') | bc)
    third_octet=$(echo 'ibase=A;obase=16;'$(echo $full_string | awk -F\. '{print $3}') | bc)
    forth_octet=$(echo 'ibase=A;obase=16;'$(echo $full_string | awk -F\. '{print $4}') | bc)
    printf "%02x%02x%02x%02x" 0x$first_octet 0x$second_octet 0x$third_octet 0x$forth_octet | tr '[A-Z]' '[a-z]'
  fi
}
 
function dec2hex {
  # Convert a decimal integer into hex.
  full_string=$1
  hex_value=$(echo 'ibase=A;obase=16;'$full_string | bc | tr '[A-Z]' '[a-z]')
  printf "%04x" 0x$hex_value
}
 
# These are the apparent source IP address and port that I see in the tcpdump
source_ip_quad=$1
source_port_dec=$2
source_ip_hex=$(quad2hex $source_ip_quad)
source_port_hex=$(dec2hex $source_port_dec)
 
# The option number is not defined and should be a variable, here I'm using 28 or 0x1c.
option_num=fd
 
# First construct a list of potential packets by filtering out the 
# source IP and port we see at the origin
packet_list=""
for filename in ${filebase}0 ${filebase}1; do
  packet_list="$packet_list
$(sudo tcpdump -x -n -r $filename 2> /tmp/client_ip.log \
    | awk '{ if ( $1 ~ /0x[0-9a-f]*:/ ) { printf("%s%s%s%s%s%s%s%s", $2, $3, $4, $5, $6, $7, $8, $9); } else { printf("\n"); } }' \
    | grep -E "^[0-9,a-f]{24}$source_ip_hex[0-9,a-f]{8,}$source_port_hex")"
done
 
  # Check for IPv4 addresses
  potential_ips="$potential_ips
$(hex2quad $(echo "$packet_list" \
  | grep -o -P ${option_num}'080348.{8}' \
  | sed s/${option_num}080348//))"
  # See below for a breakdown of this
 
  # Check for IPv6 addresses
  potential_ips="$potential_ips
$(sipcalc $(echo "$packet_list" \
  | grep -o -P ${option_num}'140348.{32}' \
  | sed s/${option_num}140348// \
  | sed -e :a -e 's/\(.*[0-9a-f]\)\([0-9a-f]\{4\}\)/\1:\2/;ta') \
  | grep Compressed | awk '{print $4}')"
  # See below for a breakdown of this
 
# There's an off chance I've got more than one so strip out any empty lines and pick the last one.
potential_ips=$(echo $potential_ips | grep . | tail -n1)
 
if [ "$potential_ips" = "" ]; then
  echo "IP not found"
  exit 1
else
  echo $potential_ips
fi

This isn't the cleanest implementation, but it gives you some simple options on how to look for this in a more automated fashion in what I hope is something that's easy to follow so you can rewrite it into the language of your choice. What you do with this data is, of course, up to you. You can log the date/time of the TCP connection with a mapping to the actual client IP address, you can rewrite your Apache/sshd/etc. logs with the actual client IP, or you can just present it to the end user like I do at diag.ebower.com using the following:

<?php 
  # Depending on the nature of the webserver you may need to pause a short while to ensure tcpdump has written the packet.
  sleep(1);
  $overlay_ip = system("/usr/local/bin/grab_client_ip $_SERVER['REMOTE_ADDR'] $_SERVER['REMOTE_PORT']"); 
?>

Packet Filtering

First I try to remove the unwanted packets, but there are several steps I can break down:

$(sudo tcpdump -x -n -r $filename 2> /tmp/client_ip.log \
    | awk '{ if ( $1 ~ /0x[0-9a-f]*:/ ) { printf("%s%s%s%s%s%s%s%s", $2, $3, $4, $5, $6, $7, $8, $9); } else { printf("\n"); } }' \
    | grep -E "^[0-9,a-f]{24}$source_ip_hex[0-9,a-f]{8,}$source_port_hex")"

First, we'll start with the tcpdump:

tcpdump -x -n -r $filename 2> /tmp/client_ip.log

We shunt STDERR to another file and we get a long output that we'll shorten to a few packets:

06:35:51.086344 IP 69.31.88.250.53374 > 172.16.13.80.443: Flags [S], seq 2699099048, win 8208, options [mss 1368,sackOK,TS val 1707509419 ecr 0,nop,wscale 7,exp-0348], length 0
	0x0000:  4500 0044 4a82 0000 3106 e7b8 451f 58fa
	0x0010:  ac10 0d50 d07e 01bb a0e0 fba8 0000 0000
	0x0020:  c002 2010 4d65 0000 0204 0558 0402 080a
	0x0030:  65c6 86ab 0000 0000 0103 0307 fd08 0348
	0x0040:  2625 790a
06:36:10.953027 IP 184.86.99.115.28778 > 172.16.13.80.443: Flags [S], seq 808304792, win 3840, options [mss 1440], length 0
	0x0000:  4500 002c aa1f 0000 3706 0483 b856 6373
	0x0010:  ac10 0d50 706a 01bb 302d c098 0000 0000
	0x0020:  6002 0f00 5125 0000 0204 05a0
06:36:11.221241 IP 209.139.35.119.4360 > 172.16.13.80.443: Flags [S], seq 278360426, win 8208, options [mss 1368,sackOK,TS val 1707529554 ecr 0,nop,wscale 7,exp-0348], length 0
	0x0000:  4500 0044 8415 0000 3606 523c d18b 2377
	0x0010:  ac10 0d50 1108 01bb 1097 716a 0000 0000
	0x0020:  c002 2010 81d3 0000 0204 0558 0402 080a
	0x0030:  65c6 d552 0000 0000 0103 0307 fd08 0348
	0x0040:  8dd4 573e
06:36:12.950483 IP 69.31.21.237.29263 > 172.16.13.80.443: Flags [S], seq 283981422, win 3840, options [mss 1440], length 0
	0x0000:  4500 002c 0e3a 0000 3606 6226 451f 15ed
	0x0010:  ac10 0d50 724f 01bb 10ed 366e 0000 0000
	0x0020:  6002 0f00 b968 0000 0204 05a0
06:36:13.250722 IP 65.114.164.241.31412 > 172.16.13.80.443: Flags [S], seq 817012563, win 3840, options [mss 1440], length 0
	0x0000:  4500 002c 39f3 0000 3506 ac15 4172 a4f1
	0x0010:  ac10 0d50 7ab4 01bb 30b2 9f53 0000 0000
	0x0020:  6002 0f00 9d01 0000 0204 05a0

We then pipe this through awk so we just end up with the hex encoding of the header:

awk '{ if ( $1 ~ /0x[0-9a-f]*:/ ) { printf("%s%s%s%s%s%s%s%s", $2, $3, $4, $5, $6, $7, $8, $9); } else { printf("\n"); } }'

This will take the output above, look for a 0xnnnn:, and put the hex data in the header into a single line per packet:

450000444a8200003106e7b8451f58faac100d50d07e01bba0e0fba800000000c00220104d650000020405580402080a65c686ab0000000001030307fd0803482625790a
4500002caa1f000037060483b8566373ac100d50706a01bb302dc0980000000060020f0051250000020405a0   
45000044841500003606523cd18b2377ac100d50110801bb1097716a00000000c002201081d30000020405580402080a65c6d5520000000001030307fd0803488dd4573e       
4500002c0e3a000036066226451f15edac100d50724f01bb10ed366e0000000060020f00b9680000020405a0   
4500002c39f300003506ac154172a4f1ac100d507ab401bb30b29f530000000060020f009d010000020405a0

Now we only take the lines that we suspect have the appropriate apparent source IP and port - this should be just one in most cases. Note that the source IP starts at character 25, then there are 8 characters for the destination IP. Sometimes there may be extra characters for IP options, so I just roll the dice that there are no options. If I wanted to be precise, I should look at the length field and make sure of that:

grep -E "^[0-9,a-f]{24}$source_ip_hex[0-9,a-f]{8}$source_port_hex"

Output:

45000044841500003606523cd18b2377ac100d50110801bb1097716a00000000c002201081d30000020405580402080a65c6d5520000000001030307fd0803488dd4573e

IPv4 Extraction Details

I take the output from the above and I assume that there's an IPv4 address embedded first:

hex2quad $(echo "$packet_list" \
  | grep -o -P ${option_num}'080348.{8}' \
  | sed s/${option_num}080348//)


I make sure that the option number exists in the string (again, I should properly parse it) and I return just this and the next 8 characters:

grep -o -P ${option_num}'080348.{8}'

Output:

fd0803488dd4573e

This final sed strips out the option and print the IP address:

sed s/${option_num}080348//

Output:

8dd4573e

And we pass this through the hex2quad function to get the dotted IP notation. Output:

141.212.87.62

IPv6 Extraction Details

This is nearly identical to above, but I assume an IPv6 address.

sipcalc $(echo "$packet_list" \
  | grep -o -P ${option_num}'140348.{32}' \
  | sed s/${option_num}140348// \
  | sed -e :a -e 's/\(.*[0-9a-f]\)\([0-9a-f]\{4\}\)/\1:\2/;ta') \
  | grep Compressed | awk '{print $4}'

Instead of looking for a length of 0x08 I look for 0x14 and grab the 32 characters after it:

grep -o -P ${option_num}'140348.{32}'

Now I've got a list of all of the IPv6-sourced packets:

450000507c520000360659ffd18b236bac100d5014da0050107b80d700000000f0021f80205f0000020405400402080a63be88dd0000000001030307fd140348200104701f070a865cd43ecd794012d4

I use sed to add the colons to the IP to turn it into an expanded format IPv6 address:

sed -e :a -e 's/\(.*[0-9a-f]\)\([0-9a-f]\{4\}\)/\1:\2/;ta'

Output:

2001:0470:1f07:0a86:5cd4:3ecd:7940:12d4

I punt on trying to format it and use sipcalc instead:

-[ipv6 : 2001:0470:1f07:0a86:5cd4:3ecd:7940:12d4] - 0

[IPV6 INFO]
Expanded Address	- 2001:0470:1f07:0a86:5cd4:3ecd:7940:12d4
Compressed address	- 2001:470:1f07:a86:5cd4:3ecd:7940:12d4
Subnet prefix (masked)	- 2001:470:1f07:a86:5cd4:3ecd:7940:12d4/128
Address ID (masked)	- 0:0:0:0:0:0:0:0/128
Prefix address		- ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff
Prefix length		- 128
Address type		- Aggregatable Global Unicast Addresses
Network range		- 2001:0470:1f07:0a86:5cd4:3ecd:7940:12d4 -
			  2001:0470:1f07:0a86:5cd4:3ecd:7940:12d4

-

Now I find the compressed address:

grep Compressed

Output:

Compressed address	- 2001:470:1f07:a86:5cd4:3ecd:7940:12d4

And finally print the IPv6 address:

awk '{print $4}'

Output:

2001:470:1f07:a86:5cd4:3ecd:7940:12d4

Using the Legacy Option

If you're using the now-deprecated draft-williams-overlaypath-ip-tcp-rfc there is only one bit you need to change. Often you'll be seeing a hijacked option 28 (0x1c), in that case you'll just need to change the grep that includes the option number to this for IPv4:

option_num = 1c
...
grep "$option_num 0701"

For IPv6 the change is:

option_num = 1c
...
grep "$option_num 1322"