终于要将 bpf2bpf & tailcall 的组合落地到项目中了。

不过,当头一棒的是:tail_calls are not allowed when call stack of previous frames is 256 bytes. Too large

幸好学习过 eBPF Talk: tailcall on x86,知道该报错出在哪里:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
// ${KERNEL}/kernel/bpf/verifier.c

/* starting from main bpf function walk all instructions of the function
 * and recursively walk all callees that given function can call.
 * Ignore jump and exit insns.
 * Since recursion is prevented by check_cfg() this algorithm
 * only needs a local stack of MAX_CALL_FRAMES to remember callsites
 */
static int check_max_stack_depth(struct bpf_verifier_env *env)
{
    // ...

process_func:
    /* protect against potential stack overflow that might happen when
     * bpf2bpf calls get combined with tailcalls. Limit the caller's stack
     * depth for such case down to 256 so that the worst case scenario
     * would result in 8k stack size (32 which is tailcall limit * 256 =
     * 8k).
     *
     * To get the idea what might happen, see an example:
     * func1 -> sub rsp, 128
     *  subfunc1 -> sub rsp, 256
     *  tailcall1 -> add rsp, 256
     *   func2 -> sub rsp, 192 (total stack size = 128 + 192 = 320)
     *   subfunc2 -> sub rsp, 64
     *   subfunc22 -> sub rsp, 128
     *   tailcall2 -> add rsp, 128
     *    func3 -> sub rsp, 32 (total stack size 128 + 192 + 64 + 32 = 416)
     *
     * tailcall will unwind the current stack frame but it will not get rid
     * of caller's stack as shown on the example above.
     */
    if (idx && subprog[idx].has_tail_call && depth >= 256) {
        verbose(env,
            "tail_calls are not allowed when call stack of previous frames is %d bytes. Too large\n",
            depth);
        return -EACCES;
    }

  // ...
}

就能够明白,报错是因为 bpf prog 消耗的栈空间不小于 256 字节。

该怎么解决呢?

解决办法不就是减小栈空间的使用量嘛。于是,我便分析了一下栈空间的消耗:

  1. 有个 struct tunnel,消耗 44 字节。
  2. 有个 struct sflow,消耗 48 字节。
  3. 有个 bpf_printk(),消耗 77 字节。

而栈深度的计算方式是向上取整(32 的倍数),超过 224 字节就取 256 字节。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
// ${KERNEL}/kernel/bpf/core.c

#define PROG_NAME_LIST(stack_size) PROG_NAME_ARGS(stack_size),
static u64 (*interpreters_args[])(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5,
                  const struct bpf_insn *insn) = {
EVAL6(PROG_NAME_LIST, 32, 64, 96, 128, 160, 192)
EVAL6(PROG_NAME_LIST, 224, 256, 288, 320, 352, 384)
EVAL4(PROG_NAME_LIST, 416, 448, 480, 512)
};
#undef PROG_NAME_LIST

void bpf_patch_call_args(struct bpf_insn *insn, u32 stack_depth)
{
    stack_depth = max_t(u32, stack_depth, 1);
    insn->off = (s16) insn->imm;
    insn->imm = interpreters_args[(round_up(stack_depth, 32) / 32) - 1] -
        __bpf_call_base_args;
    insn->code = BPF_JMP | BPF_CALL_ARGS;
}

“杀栈凶手” 就是那个 bpf_printk() ,干掉后就不再报这个错了。

Q: bpf_printk() 为什么消耗的是栈空间?

A: 如果 bpf_printk() 的字符串不放在栈里,放在哪里?通过查看 BPF ELF 文件里的汇编、或者 bpftool prog dump xlated,可以确认 bpf_printk() 的字符串就是放在栈上:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# bpftool prog dump xlated id ${PROG ID}

 1301: (73) *(u8 *)(r10 -12) = r1
 1302: (b7) r1 = 175650864
 1303: (63) *(u32 *)(r10 -16) = r1
 1304: (18) r1 = 0x2578303d72646461
 1306: (7b) *(u64 *)(r10 -24) = r1
 1307: (18) r1 = 0x6420783830257830
 1309: (7b) *(u64 *)(r10 -32) = r1
 1310: (18) r1 = 0x3d72646461732079
 1312: (7b) *(u64 *)(r10 -40) = r1
 1313: (18) r1 = 0x616c7265766f202c
 1315: (7b) *(u64 *)(r10 -48) = r1
 1316: (7b) *(u64 *)(r10 -56) = r7
 1317: (61) r4 = *(u32 *)(r9 +16)
 1318: (61) r3 = *(u32 *)(r9 +12)
 1319: (bf) r1 = r10
;
 1320: (07) r1 += -56
; bpf_printk("acl-test, overlay saddr=0x%08x daddr=0x%08x\n", iph->saddr, iph->daddr);
 1321: (b7) r2 = 45
 1322: (85) call bpf_trace_printk#-62736