在做 Linux 代理流量回放实验 时,因 ip route 配置失误导致实验失败;排查了一个星期,最终在阅读 Inline on a Linux router 博客时发现了配置失误的地方。

代理流量回放实验

在 namespace server 中,使用策略路由将 client 程序发送过来的任意 IP、TCP 端口的 IP 包导流到 server 程序,其中在 namespace server 中的策略路由配置如下:

1
2
3
4
5
6
7
iptables -t mangle -N DIVERT
iptables -t mangle -A PREROUTING -p tcp -m socket -j DIVERT
iptables -t mangle -A DIVERT -j MARK --set-mark 1
iptables -t mangle -A DIVERT -j ACCEPT
iptables -t mangle -A PREROUTING -p tcp -j TPROXY --tproxy-mark 0x1/0x1 --on-port ${LISTEN_PORT}
ip rule add fwmark 1 lookup 100
ip route add default dev ${VETH_SERVER_INNER} scope host table 100

其中最后那条 ip route 的配置是有问题的,下面就是该问题的排查过程。

学习 Linux 收包过程

学习资料:

  1. linux TCP/IP协议栈-IP层
  2. It’s crowded in here! - The Cloudflare Blog

其中 Cloudflare 的博客提到的 Linux 接收网络包的阶段:

bpf inet_lookup

既然提到了 bpf,那就使用 bpf 来排查一下吧。

使用 bpf 排查问题

使用性能优化大师 Brendan Gregg 在 BPF Performance Tools 书中提供的 skbdrop.bt 工具,执行结果如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
bpftrace --unsafe skbdrop.bt
Attaching 3 probes...
Tracing unusual skb drop stacks. Hit Ctrl-C to end.
^C#kernel
IpInReceives                    17                 0.0
IpInDelivers                    17                 0.0
IpOutRequests                   22                 0.0
TcpInSegs                       17                 0.0
TcpOutSegs                      21                 0.0
TcpRetransSegs                  1                  0.0
TcpExtTCPHPHits                 1                  0.0
TcpExtTCPPureAcks               8                  0.0
TcpExtTCPHPAcks                 5                  0.0
TcpExtTCPTimeouts               1                  0.0
TcpExtTCPSynRetrans             1                  0.0
TcpExtTCPOrigDataSent           20                 0.0
TcpExtTCPDelivered              20                 0.0
TcpExtTcpTimeoutRehash          1                  0.0
IpExtInOctets                   880                0.0
IpExtOutOctets                  2239               0.0
IpExtInNoECTPkts                17                 0.0

@[
    kfree_skb+118
    kfree_skb+118
    ip_error+134
    ip_rcv_finish+135
    ip_rcv+188
    __netif_receive_skb_one_core+136
    __netif_receive_skb+24
    process_backlog+169
]: 2
@[
    kfree_skb+118
    kfree_skb+118
    unix_stream_connect+1919
    __sys_connect_file+95
    __sys_connect+161
    __x64_sys_connect+26
    do_syscall_64+73
    entry_SYSCALL_64_after_hwframe+68
]: 2

根据 uname -r 去查看 ip_rcv_finish 源代码,发现 ip_rcv_finish 函数里并没有调用 ip_error;根据协议栈来排查这问题的办法无法进行下去了,因为不知道该如何继续阅读 ip_rcv_finish 后面的代码。

继续阅读 TProxy 相关的博客

在阅读 Inline on a Linux router 博客的时候,留意了一下这条命令: ip route add local 0.0.0.0/0 dev lo table 1 ,这里为什么会有个 local 呢?

不管三七二十一,先加上再说。

1
2
3
4
5
6
7
iptables -t mangle -N DIVERT
iptables -t mangle -A PREROUTING -p tcp -m socket -j DIVERT
iptables -t mangle -A DIVERT -j MARK --set-mark 1
iptables -t mangle -A DIVERT -j ACCEPT
iptables -t mangle -A PREROUTING -p tcp -j TPROXY --tproxy-mark 0x1/0x1 --on-port ${LISTEN_PORT}
ip rule add fwmark 1 lookup 100
ip route add local default dev ${VETH_SERVER_INNER} scope host table 100

终于,实验成功了。

local 是什么?

有问题,就找 man:man ip-route,在线文档:iproute(8)。 其中有一段内容:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
       Route types:

               unicast - the route entry describes real paths to the destinations covered by the route prefix.

               unreachable - these destinations are unreachable. Packets are discarded and the ICMP message host unreachable is generated.  The
               local senders get an EHOSTUNREACH error.

               blackhole - these destinations are unreachable. Packets are discarded silently.  The local senders get an EINVAL error.

               prohibit - these destinations are unreachable. Packets are discarded and the ICMP message communication administratively prohib‐
               ited is generated. The local senders get an EACCES error.

               local - the destinations are assigned to this host. The packets are looped back and delivered locally.

               broadcast - the destinations are broadcast addresses. The packets are sent as link broadcasts.

               throw - a special control route used together with policy rules. If such a route is selected, lookup in this table is terminated
               pretending that no route was found. Without policy routing it is equivalent to the absence of the route in the routing table. The
               packets are dropped and the ICMP message net unreachable is generated. The local senders get an ENETUNREACH error.

               nat - a special NAT route. Destinations covered by the prefix are considered to be dummy (or external) addresses which require
               translation to real (or internal) ones before forwarding. The addresses to translate to are selected with the attribute via.
               Warning: Route NAT is no longer supported in Linux 2.6.

               anycast - not implemented the destinations are anycast addresses assigned to this host. They are mainly equivalent to local with
               one difference: such addresses are invalid when used as the source address of any packet.

               multicast - a special type used for multicast routing. It is not present in normal routing tables.

local 是一个路由类型,指将网络包发给系统本地协议栈。

总结

一个 “小小的” 偏差,导致了一个星期的时间消耗,只能说明自己的知识储备还不够深厚。

得好好阅读一下内核网络协议栈的源代码,加深对 Linux 系统收发网络包的理解。