通过 Linux 网络: sequence file 的学习,我们知道在 Linux 里有不少地方都使用 sequence file 机制向用户态空间传递数据。

在 eBPF 的加持下,sequence file 机制迎来了更加灵活的实现:使用 eBPF 按需制定 record 格式,不再局限于内核中固定的格式。

bpf_iter 参考资料:

bpf_iter demo

demo 先行,将内核源代码里的 tcp4 例子适配一下,然后跑起来:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# cat /proc/net/tcp
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode
   0: 00000000:0016 00000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 21171 1 ffff8c6a5561e300 100 0 0 10 0
   1: 3500007F:0035 00000000:0000 0A 00000000:00000000 00:00000000 00000000   101        0 21025 1 ffff8c6a5561c800 100 0 0 10 5
   2: 8A01A8C0:0016 0C01A8C0:F709 01 00000000:00000000 02:000794F0 00000000     0        0 28502 2 ffff8c6a5561d100 20 4 11 10 -1
   3: 8A01A8C0:0016 0C01A8C0:CB4F 01 00000000:00000000 02:000A1323 00000000     0        0 33929 2 ffff8c6a5561ec00 20 4 25 10 -1
   4: 8A01A8C0:0016 0C01A8C0:CC0B 01 00000000:00000000 02:000AF05D 00000000     0        0 34017 4 ffff8c6a55619200 20 4 29 10 -1
   5: 8A01A8C0:0016 0C01A8C0:CA1F 01 00000000:00000000 02:0008919D 00000000     0        0 30667 2 ffff8c6a5561da00 20 4 23 10 -1
# cat /sys/fs/bpf/itertcp4
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode
   1: 00000000:0016 00000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 21171 2 ffff8c6a5561e300 100 0 0 10 0
   2: 3500007F:0035 00000000:0000 0A 00000000:00000000 00:00000000 00000000   101        0 21025 2 ffff8c6a5561c800 100 0 0 10 5
   3: 8A01A8C0:0016 0C01A8C0:F709 01 00000000:00000000 02:00074E54 00000000     0        0 28502 3 ffff8c6a5561d100 20 4 10 10 -1
   4: 8A01A8C0:0016 0C01A8C0:CB4F 01 00000000:00000000 02:0009CC88 00000000     0        0 33929 3 ffff8c6a5561ec00 20 4 26 10 -1
   5: 8A01A8C0:0016 0C01A8C0:CC0B 01 00000000:00000000 02:000AA9C2 00000000     0        0 34017 4 ffff8c6a55619200 20 4 18 10 -1
   6: 8A01A8C0:0016 0C01A8C0:CA1F 01 00000000:00000000 02:00084B02 00000000     0        0 30667 3 ffff8c6a5561da00 20 4 24 10 -1

可以看到,通过 bpf_iter,能够实现 /proc/net/tcp 同样的功能。

更多例子,请参考 tools/testing/selftests/bpf/progs 目录下 bpf_iter_*.c 文件。

bpf_iter 用法

参考 demo,bpf_iter 使用起来的关键是如下两点:

  1. bpf_iter_meta: bpf_iter 的元数据,包括 seq_filebpf_iter 的会话 ID 及调用次数。
  2. bpf_seq_printf: 使用宏 BPF_SEQ_PRINTFseq_file 写入数据。

bpf_iter_meta 定义如下:

1
2
3
4
5
6
7
struct bpf_iter_meta {
    union {
        struct seq_file *seq;
    };
    u64 session_id;
    u64 seq_num;
};

每个 bpf_iter bpf prog 的 ctx 里都带有一个 struct bpf_iter_meta *meta

bpf_seq_printf helper 的定义如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// ${KERNEL}/kernel/trace/bpf_trace.c

BPF_CALL_5(bpf_seq_printf, struct seq_file *, m, char *, fmt, u32, fmt_size,
       const void *, data, u32, data_len)
{
    int err, num_args;
    u32 *bin_args;

    if (data_len & 7 || data_len > MAX_BPRINTF_VARARGS * 8 ||
        (data_len && !data))
        return -EINVAL;
    num_args = data_len / 8;

    err = bpf_bprintf_prepare(fmt, fmt_size, data, &bin_args, num_args);
    if (err < 0)
        return err;

    seq_bprintf(m, fmt, bin_args);

    bpf_bprintf_cleanup();

    return seq_has_overflowed(m) ? -EOVERFLOW : 0;
}

由定义可知,bpf_seq_printf helper 有如下限制:

  1. 数据长度必须是 8 的倍数。
  2. 数据长度不能超过 96(12*8);意即 bpf_seq_printf helper 能够格式化最多 12 个参数。

不过,类似 bpf_snprintf helper,bpf_seq_printf helper 也支持使用 %p{i,I}{4,6} 对 IP 地址进行格式化。

bpf_iter bpf prog

bpf_iter bpf prog 由每个具体的 bpf_iter 实现的 seq_file 中的 show 函数调用,对 bpf prog 的返回值有如下处理:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
// ${KERNEL}/kernel/bpf/bpf_iter.c

        err = seq->op->show(seq, p);
        if (err > 0) {
            bpf_iter_dec_seq_num(seq);
            seq->count = offs;
        } else if (err < 0 || seq_has_overflowed(seq)) {
            seq->count = offs;
            if (offs == 0) {
                if (!err)
                    err = -E2BIG;
                seq->op->stop(seq, p);
                goto done;
            }
            break;
        }

分以下 3 中情况:

  1. 小于 0: 异常情况,停止处理,结束会话。
  2. 等于 0: 正常情况,继续处理下一个 record。
  3. 大于 0: 跳过当前处理的 record,继续处理下一个 record。

bpf_iter 原理

bpf 子系统里,bpf_iter 的源代码和历史 commit 并不多,可以快速阅读一下:

bpf_iter 最低要求 5.8 kernel。

基于 Linux seq_file 机制,bpf 实现了一套运行 bpf prog 的 bpf_iter 机制,record 格式和内容由 bpf prog 来决定。

bpf_iter 实现

在内核源代码仓库里全局搜索 DEFINE_BPF_ITER_FUNC

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
kernel/bpf/task_iter.c:619:DEFINE_BPF_ITER_FUNC(task_vma, struct bpf_iter_meta *meta,
kernel/bpf/task_iter.c:368:DEFINE_BPF_ITER_FUNC(task_file, struct bpf_iter_meta *meta,
kernel/bpf/task_iter.c:194:DEFINE_BPF_ITER_FUNC(task, struct bpf_iter_meta *meta, struct task_struct *task)
kernel/bpf/link_iter.c:42:DEFINE_BPF_ITER_FUNC(bpf_link, struct bpf_iter_meta *meta, struct bpf_link *1ink)
kernel/bpf/map_iter.c:165:DEFINE_BPF_ITER_FUNC(bpf_map_elem, struct bpf_iter_meta *meta,
kernel/bpf/map_iter.c:42:DEFINE_BPF_ITER_FUNC(bpf_map, struct bpf_iter_meta *meta, struct bpf_map *map)
kernel/bpf/cgroup_iter.c:258:DEFINE_BPF_ITER_FUNC(cgroup, struct bpf_iter_meta *meta,
kernel/bpf/prog_iter.c:42:DEFINE_BPF_ITER_FUNC(bpf_prog, struct bpf_iter_meta *meta, struct bpf_prog *prog)
kernel/kallsyms.c:840:DEFINE_BPF_ITER_FUNC(ksym, struct bpf_iter_meta *meta, struct kallsym_iter *ksym)
include/linux/bpf.h:1799:#define DEFINE_BPF_ITER_FUNC(target, args...) \
net/ipv6/route.c:6632:DEFINE_BPF_ITER_FUNC(ipv6_route, struct bpf_iter_meta *meta, struct fib6_info *rt)
net/unix/af_unix.c:3652:DEFINE_BPF_ITER FUNC(unix, struct bpf_iter_meta *meta,
net/ipv4/udp.c:3274:DEFINE_BPF_ITER_FUNC(udp, struct bpf_iter_meta *meta,
net/ipv4/tcp_ipv4.c:3250:DEFINE_BPF_ITER_FUNC(tcp, struct bpf_iter_meta *meta,
net/netlink/af_netlink.c:2716:DEFINE_BPF_ITER_FUNC(netlink, struct bpf_iter_meta *meta, struct netlink_sock *sk)
net/core/sock_map.c:702:DEFINE_BPF_ITER _FUNC(sockmap, struct bpf_iter_meta *meta,
net/core/bpf_sk_storage.c:830:DEFINE_BPF_ITER_FUNC(bpf_sk_storage_map, struct bpf_iter_meta *meta,

除去 DEFINE_BPF_ITER_FUNC 宏定义,一共有 16 个地方支持 bpf_iter 机制。

  1. task: bpf: Add task and task/file iterator targets since 5.8 kernel
  2. task_file: bpf: Add task and task/file iterator targets since 5.8 kernel
  3. task_vma: bpf: Introduce task_vma bpf_iter since 5.12 kernel
  4. bpf_link: bpf: Add bpf_link iterator since 5.19 kernel
  5. bpf_map: bpf: Add bpf_map iterator since 5.8 kernel
  6. bpf_map_elem: bpf: Implement bpf iterator for map elements since 5.9 kernel
  7. cgroup: bpf: Introduce cgroup iter since 6.1 kernel
  8. bpf_prog: bpf: Add bpf_prog iterator since 5.9 kernel
  9. ksym: bpf: add a ksym BPF iterator since 6.0 kernel
  10. ipv6_route: net: bpf: Add netlink and ipv6_route bpf_iter targets since 5.8 kernel
  11. UNIX domain socket: bpf: af_unix: Implement BPF iterator for UNIX domain socket. since 5.15 kernel
  12. udp (IPv4, IPv6): net: bpf: Implement bpf iterator for udp since 5.9 kernel
  13. tcp (IPv4, IPv6): net: bpf: Implement bpf iterator for tcp since 5.9 kernel
  14. netlink: net: bpf: Add netlink and ipv6_route bpf_iter targets since 5.8 kernel
  15. sockmap/sockhash: net: Allow iterating sockmap and sockhash since 5.10 kernel
  16. bpf_sk_storage: bpf: Implement bpf iterator for sock local storage map since 5.9 kernel

小结

bpf_iter 是 bpf 子系统里基于 seq_file 机制实现的一套灵活的 sequence 机制,它的 record 格式和内容均由用户编写的 bpf prog 决定。