eBPF Talk: trampoline on x86【汇编慎入】

本系列是 x86 架构平台上 trampoline 的实现，从原理和实现上进行了详细的介绍。

类似 freplace，trampoline 是对 prologue 进行 poke 的另一个应用。

TL;DR 类似 freplace，trampoline 是将 prologue 里的第一条 nop 指令替换成 call 指令；call 后可以回去/也可以不回去继续执行原来的函数。

以前了解到 trampoline 的时候，只知道基于 trampoline 的 fentry/fexit 比 kprobe/kretprobe 好；但却不知为什么好。

eBPF Talk: 比 kprobe 更好的 trampoline

general trampoline

trampoline 是一个常用的指令操作技术，并不局限于 Linux 系统，同时适用于 Windows、macOS 等系统。

一个基于 jmp 实现的 trampoline 示意图如下：

trampoline 是许多高级工具的底层技术，比如调试器、动态追踪、动态注入等。

trampoline 学习资料：

YouTube: C++ Internal Trampoline Hook Tutorial - OpenGL Hook （强烈推荐，看完就懂）
x86 API Hooking Demystified

bpf trampoline on x86

得益于 Linux kernel 为每个可 trace 的函数预留了位于最前方的 5 个字节大小的 nop 指令，bpf trampoline 的实现并没有 general trampoline 的那么复杂；直接省去了 function_gate 部分。

目前，基于 trampoline 技术的 bpf 特性有 fentry、fexit 和 fmod_ret。

在 kernel 里，bpf trampoline on x86 的实现如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44


// ${KERNEL}/arch/x86/net/bpf_jit_comp.c

/* Example:
 * ...
 *
 * The assembly code when eth_type_trans is called from trampoline:
 *
 * push rbp
 * mov rbp, rsp
 * sub rsp, 24                     // space for skb, dev, return value
 * push rbx                        // temp regs to pass start time
 * mov qword ptr [rbp - 24], rdi   // save skb pointer to stack
 * mov qword ptr [rbp - 16], rsi   // save dev pointer to stack
 * call __bpf_prog_enter           // rcu_read_lock and preempt_disable
 * mov rbx, rax                    // remember start time if bpf stats are enabled
 * lea rdi, [rbp - 24]             // R1==ctx of bpf prog
 * call addr_of_jited_FENTRY_prog  // bpf prog can access skb and dev
 * movabsq rdi, 64bit_addr_of_struct_bpf_prog  // unused if bpf stats are off
 * mov rsi, rbx                    // prog start time
 * call __bpf_prog_exit            // rcu_read_unlock, preempt_enable and stats math
 * mov rdi, qword ptr [rbp - 24]   // restore skb pointer from stack
 * mov rsi, qword ptr [rbp - 16]   // restore dev pointer from stack
 * call eth_type_trans+5           // execute body of eth_type_trans
 * mov qword ptr [rbp - 8], rax    // save return value
 * call __bpf_prog_enter           // rcu_read_lock and preempt_disable
 * mov rbx, rax                    // remember start time in bpf stats are enabled
 * lea rdi, [rbp - 24]             // R1==ctx of bpf prog
 * call addr_of_jited_FEXIT_prog   // bpf prog can access skb, dev, return value
 * movabsq rdi, 64bit_addr_of_struct_bpf_prog  // unused if bpf stats are off
 * mov rsi, rbx                    // prog start time
 * call __bpf_prog_exit            // rcu_read_unlock, preempt_enable and stats math
 * mov rax, qword ptr [rbp - 8]    // restore eth_type_trans's return value
 * pop rbx
 * leave
 * add rsp, 8                      // skip eth_type_trans's frame
 * ret                             // return to its caller
 */
int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *image_end,
                                const struct btf_func_model *m, u32 flags,
                                struct bpf_tramp_links *tlinks,
                                void *func_addr)
{
    // ...
}

注释里的 example 已说明了 bpf trampoline 的实现原理：

call __bpf_prog_enter。
call 目标函数入口偏移 5 个字节的地址。
call fentry bpf progs。
call 目标函数。
call fmod_ret bpf progs。
call fexit bpf progs。
call __bpf_prog_exit。
按需跳过目标函数的执行。

fentry 的实现原理如下：

fexit 的实现原理如下：

上面是 bpf trampoline 指令级别的学习，下面看下 bpf trampoline 是什么时候使用 poke 的。

demo 源代码：fentry_fexit。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


// ${cilium/ebpf}/link/tracing.go

func AttachTracing(opts TracingOptions) (Link, error) {
    // ...

    return attachBTFID(opts.Program)
}

func attachBTFID(program *ebpf.Program) (Link, error) {
    // ...

    fd, err := sys.RawTracepointOpen(&sys.RawTracepointOpenAttr{
        ProgFd: uint32(program.FD()),
    })
    // ...

    return &tracing{RawLink: RawLink{fd: fd}}, nil
}

// ${cilium/ebpf}/internal/sys/types.go

func RawTracepointOpen(attr *RawTracepointOpenAttr) (*FD, error) {
    fd, err := BPF(BPF_RAW_TRACEPOINT_OPEN, unsafe.Pointer(attr), unsafe.Sizeof(*attr))
    if err != nil {
        return nil, err
    }
    return NewFD(int(fd))
}

由上面的代码片段可知，调用的是 bpf() 系统调用中的 BPF_RAW_TRACEPOINT_OPEN 命令。

下面看下 kernel 里发生了什么：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)      // ${KERNEL}/kernel/bpf/syscall.c
|-->__sys_bpf()
    |-->bpf_raw_tracepoint_open()
        |-->bpf_raw_tp_link_attach()
            |-->bpf_tracing_prog_attach()
                |-->bpf_link_init()
                |-->tr = prog->aux->dst_trampoline;
                |-->bpf_link_prime()
                |-->bpf_trampoline_link_prog()                                          // ${KERNEL}/kernel/bpf/trampoline.c
                |   |-->__bpf_trampoline_link_prog()
                |       |-->bpf_trampoline_update()
                |           |-->arch_prepare_bpf_trampoline()                           // ${KERNEL}/arch/x86/net/bpf_jit_comp.c
                |           |-->register_fentry()
                |               |-->void *ip = tr->func.addr;
                |               |-->bpf_arch_text_poke(ip, BPF_MOD_CALL, NULL, new_addr);
                |-->bpf_link_settle()

看到 bpf_arch_text_poke() 函数时，便可知道上面代码片段的主要处理逻辑：

直接取出 bpf prog 里已准备好的目标 trampoline 对象。
调用 arch_prepare_bpf_trampoline() 生成一段 bpf trampoline 程序。
调用 bpf_arch_text_poke() 将目标 trampoline 入口的第一条 nop 指令 live patch 成 call 指令，call 到新生成的 trampoline 的入口地址。

总结

只要掌握了 bpf trampoline 的实现原理，便能轻松掌握 fentry、fexit 和 fmod_ret 等特性。

不过，好奇的是：使用 fexit 的时候，为什么在 ret 前增加 add $0x8,%rsp 指令就能跳过执行原来的函数？

文章目录

general trampoline

bpf trampoline on x86

总结