凑个热闹,静态地对比一下 tracepoint、raw_tp/tp_btf、kprobe 和 fentry/fexit 四种 bpf 动态追踪技术的运行时性能。

什么?看到那位大佬用调试内核的方式来对比 tracepoint 和 kprobe 的性能,简直牛刀小用,可操作性不是普通人能及的。

那么,既然已经到了 bpf 时代,有没有使用 bpf 且简单可行的方式来对比他们的运行时性能呢?

btrace 是一款 bpf 时代的现代化内核函数动态追踪工具,能够用来追踪 tracee 的函数调用栈、逻辑分支记录栈,也能够通过反汇编的方式分析 tracee 的机器码。

本文将使用 btrace 来静态地对比它们的运行时性能。

P.S. btrace 项目源代码:github.com/leonhwangprojects/btrace

实验环境

这是一台 Intel CPU 的小主机,安装了 Ubuntu 24.10 LTS 操作系统;该 Intel CPU 支持 LBR 特性。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
$ dmesg | grep -i lbr
[    0.101408] Performance Events: XSAVE Architectural LBR, PEBS fmt4+-baseline,  AnyThread deprecated, Alderlake Hybrid events, 32-deep LBR, full-width counters, Intel PMU driver.

$ uname -a
Linux bpf-dev 6.11.0-8-generic #8-Ubuntu SMP PREEMPT_DYNAMIC Mon Sep 16 13:41:20 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 24.10
Release:        24.10
Codename:       oracular

$ clang --version
Ubuntu clang version 18.1.8 (11)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

$ go version
go version go1.24.0 linux/amd64

分析 tracepoint 的 bpf 程序被调用前的函数调用栈、逻辑分支记录栈和反汇编的机器码

tracepoint 学习资料推荐:

netlink/netlink_extack tracepoint 为例:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
$ cat /sys/kernel/debug/tracing/events/netlink/netlink_extack/format
name: netlink_extack
ID: 1680
format:
    field:unsigned short common_type;   offset:0;   size:2; signed:0;
    field:unsigned char common_flags;   offset:2;   size:1; signed:0;
    field:unsigned char common_preempt_count;   offset:3;   size:1; signed:0;
    field:int common_pid;   offset:4;   size:4; signed:1;

    field:__data_loc char[] msg;    offset:8;   size:4; signed:0;

print fmt: "msg=%s", __get_str(msg)

源代码:

1
2
3
4
5
6
7
// net/netlink/af_netlink.c

void do_trace_netlink_extack(const char *msg)
{
    trace_netlink_extack(msg);
}
EXPORT_SYMBOL(do_trace_netlink_extack);

tracepoint 的 bpf 程序被调用前的函数调用栈

使用 eBPF Talk: trace tracepoint 程序 里的例子,直接使用 btrace 看看 tracepoint 里调用 bpf 程序的函数调用栈:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
$ sudo ./btrace -p 183 --output-stack --limit-events 1 -v
...
2025/03/16 21:34:53 btrace is running..
tp__netlink_extack[bpf] args=((struct netlink_extack_error_ctx *)ctx=0xffffc0c73f208470) retval=(int)0 cpu=8 process=(14792:fentry_fexit-tr)
Func stack:
  ...
  0xffffffffc03836a1:tp__netlink_extack+0x5                             ; tracepoint.c:27
  0xffffffffbbbb7873:perf_trace_run_bpf_submit+0x43                     ; kernel/events/core.c:10315
  0xffffffffbc7952a4:perf_trace_netlink_extack+0x104                    ; include/trace/events/netlink.h:9
  0xffffffffbc794619:do_trace_netlink_extack+0x49                       ; include/trace/events/netlink.h:9
  0xffffffffbc787308:__tcf_qdisc_find.part.0+0x1a8                      ; net/sched/cls_api.c:1214
  0xffffffffbc78c95a:tc_new_tfilter+0x16a                               ; net/sched/cls_api.c:1195
  0xffffffffbc71a400:rtnetlink_rcv_msg+0x380                            ; net/core/rtnetlink.c:6638
  0xffffffffbc79a872:netlink_rcv_skb+0x52                               ; net/netlink/af_netlink.c:2550
  0xffffffffbc7180c5:rtnetlink_rcv+0x15                                 ; net/core/rtnetlink.c:6666
  0xffffffffbc799fa5:netlink_unicast+0x245                              ; net/netlink/af_netlink.c:1331
  0xffffffffbc79a314:netlink_sendmsg+0x214                              ; net/netlink/af_netlink.c:1901
  ...

可以看到,从 __tcf_qdisc_findtp__netlink_extack bpf 程序有 4 个函数调用深度。

tracepoint 的 bpf 程序被调用前的逻辑分支记录栈

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
$ sudo ./btrace -p 321 --mode entry --output-lbr --limit-events 1 -v
...
2025/03/16 23:36:20 btrace is running..
tp__netlink_extack[bpf] args=((struct netlink_extack_error_ctx *)ctx=0xffffc0c73ef10ec0) cpu=2 process=(23656:fentry_fexit-tr)
LBR stack:
[#31] 0xffffffffbc9b4937:strlen+0x17                               (lib/string.c:402)                     -> 0xffffffffbc9b4930:strlen+0x10                               (lib/string.c:402)
      0xffffffffbc9b4937:strlen+0x17                               (lib/string.c:402)                     -> 0xffffffffbc9b4930:strlen+0x10                               (lib/string.c:402)
      0xffffffffbc9b4937:strlen+0x17                               (lib/string.c:402)                     -> 0xffffffffbc9b4930:strlen+0x10                               (lib/string.c:402)
      0xffffffffbc9b4937:strlen+0x17                               (lib/string.c:402)                     -> 0xffffffffbc9b4930:strlen+0x10                               (lib/string.c:402)
      0xffffffffbc9b4937:strlen+0x17                               (lib/string.c:402)                     -> 0xffffffffbc9b4930:strlen+0x10                               (lib/string.c:402)
      0xffffffffbc9b493e:strlen+0x1e                               (lib/string.c:404)                     -> 0xffffffffbc7951e4:perf_trace_netlink_extack+0x44            (include/trace/events/netlink.h:9)
      0xffffffffbc795221:perf_trace_netlink_extack+0x81            (include/trace/events/netlink.h:9)     -> 0xffffffffbbacbc60:perf_trace_buf_alloc+0x0                  (kernel/trace/trace_event_perf.c:393)
      0xffffffffbbacbc81:perf_trace_buf_alloc+0x21                 (kernel/trace/trace_event_perf.c:404)  -> 0xffffffffbbba7350:perf_swevent_get_recursion_context+0x0    (kernel/events/core.c:10000)
      0xffffffffbbba73a5:perf_swevent_get_recursion_context+0x55   (kernel/events/core.c:10002)           -> 0xffffffffbbacbc86:perf_trace_buf_alloc+0x26                 (kernel/trace/trace_event_perf.c:404)
      0xffffffffbbacbceb:perf_trace_buf_alloc+0x8b                 (kernel/trace/trace_event_perf.c:415)  -> 0xffffffffbc795226:perf_trace_netlink_extack+0x86            (include/trace/events/netlink.h:9)
      0xffffffffbc795276:perf_trace_netlink_extack+0xd6            (include/trace/events/netlink.h:9)     -> 0xffffffffbcac4fe0:__memcpy+0x0                              (arch/x86/lib/memcpy_64.S:34)
      0xffffffffbcac4fea:__memcpy+0xa                              (arch/x86/lib/memcpy_64.S:39)          -> 0xffffffffbc79527b:perf_trace_netlink_extack+0xdb            (include/trace/events/netlink.h:9)
      0xffffffffbc79529f:perf_trace_netlink_extack+0xff            (include/trace/events/netlink.h:9)     -> 0xffffffffbbbb7830:perf_trace_run_bpf_submit+0x0             (kernel/events/core.c:10312)
      0xffffffffbbbb786e:perf_trace_run_bpf_submit+0x3e            (kernel/events/core.c:10315)           -> 0xffffffffbbaf0ad0:trace_call_bpf+0x0                        (kernel/trace/bpf_trace.c:112)
      0xffffffffbbaf0b0a:trace_call_bpf+0x3a                       (kernel/trace/bpf_trace.c:117)         -> 0xffffffffbb9d3a20:__rcu_read_lock+0x0                       (kernel/rcu/tree_plugin.h:411)
      0xffffffffbb9d3a45:__rcu_read_lock+0x25                      (kernel/rcu/tree_plugin.h:418)         -> 0xffffffffbbaf0b0f:trace_call_bpf+0x3f                       (kernel/trace/bpf_trace.c:147)
      0xffffffffbbaf0b38:trace_call_bpf+0x68                       (include/linux/bpf.h:2099)             -> 0xffffffffbb952360:migrate_disable+0x0                       (kernel/sched/core.c:2232)
      0xffffffffbb952377:migrate_disable+0x17                      (kernel/sched/core.c:2235)             -> 0xffffffffbb95238c:migrate_disable+0x2c                      (kernel/sched/core.c:2237)
      0xffffffffbb9523bb:migrate_disable+0x5b                      (include/linux/preempt.h:480)          -> 0xffffffffbbaf0b3d:trace_call_bpf+0x6d                       (include/linux/bpf.h:2099)
      0xffffffffbbaf0b84:trace_call_bpf+0xb4                       (include/linux/bpf.h:1243)             -> 0xffffffffc0383688:tp__netlink_extack+0x0                    (tracepoint.c:27)
      0xffffffffc0383688:tp__netlink_extack+0x0                    (tracepoint.c:27)                      -> 0xffffffffc03a1dc0:bpf_trampoline_1378684502017+0x0
      ...

Emm,无法完整地记录从 __tcf_qdisc_findtp__netlink_extack bpf 程序的逻辑分支记录栈;说明期间涉及的逻辑分支有点多,LBR 有限,无法完全记录。

tracepoint 的 bpf 程序被调用前的反汇编的机器码

btrace 提供了对内核函数和 bpf 程序进行反汇编的能力,可以看看 tp__netlink_extack bpf 程序被调用前的反汇编的机器码:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
$ sudo ./btrace -d -k __tcf_qdisc_find.part.0
2025/03/16 21:34:08 Disassembling __tcf_qdisc_find.part.0 at 0xffffffffbc787160 (464 bytes) ..
; __tcf_qdisc_find.part.0+0x0 net/sched/cls_api.c:1187
0xffffffffbc787160: 0f 1f 44 00 00          nopl    (%rax, %rax)
...
; __tcf_qdisc_find.part.0+0xb5 net/sched/cls_api.c:1212
0xffffffffbc787215: 66 31 f6                xorw    %si, %si
0xffffffffbc787218: 48 89 c7                movq    %rax, %rdi
0xffffffffbc78721b: e8 40 ed ff ff          callq   0xffffffffbc785f60  ; qdisc_lookup_rcu+0x0 net/sched/sch_api.c:319
0xffffffffbc787220: 49 89 04 24             movq    %rax, (%r12)
0xffffffffbc787224: 48 89 c3                movq    %rax, %rbx
; __tcf_qdisc_find.part.0+0xc7 net/sched/cls_api.c:1213
0xffffffffbc787227: 48 85 c0                testq   %rax, %rax
0xffffffffbc78722a: 0f 84 cc 00 00 00       je      0xffffffffbc7872fc  ; __tcf_qdisc_find.part.0+0x19c net/sched/cls_api.c:1214
...
; __tcf_qdisc_find.part.0+0x19c net/sched/cls_api.c:1214
0xffffffffbc7872fc: 48 c7 c7 10 de f7 bc    movq    $-0x430821f0, %rdi
0xffffffffbc787303: e8 c8 d2 00 00          callq   0xffffffffbc7945d0  ; do_trace_netlink_extack+0x0 net/netlink/af_netlink.c:155
0xffffffffbc787308: 4d 85 ed                testq   %r13, %r13
0xffffffffbc78730b: 0f 84 74 ff ff ff       je      0xffffffffbc787285  ; __tcf_qdisc_find.part.0+0x125 net/sched/cls_api.c:1215
0xffffffffbc787311: 49 c7 45 00 10 de f7 bc movq    $-0x430821f0, (%r13)
0xffffffffbc787319: e9 67 ff ff ff          jmp     0xffffffffbc787285  ; __tcf_qdisc_find.part.0+0x125 net/sched/cls_api.c:1215
0xffffffffbc78731e: e8 6d 28 25 ff          callq   0xffffffffbb9d9b90  ; __rcu_read_unlock+0x0 kernel/rcu/tree_plugin.h:429
; __tcf_qdisc_find.part.0+0x1c3 net/sched/cls_api.c:1204
0xffffffffbc787323: bb ed ff ff ff          movl    $0xffffffed, %ebx
0xffffffffbc787328: e9 c5 fe ff ff          jmp 0xffffffffbc7871f2      ; __tcf_qdisc_find.part.0+0x92 net/sched/cls_api.c:1260
0xffffffffbc78732d: 0f 1f 00                nopl    (%rax)

$ sudo ./btrace -d -k do_trace_netlink_extack
2025/03/16 21:47:23 Disassembling do_trace_netlink_extack at 0xffffffffbc7945d0 (128 bytes) ..
; do_trace_netlink_extack+0x0 net/netlink/af_netlink.c:155
0xffffffffbc7945d0: 0f 1f 44 00 00          nopl    (%rax, %rax)
0xffffffffbc7945d5: 66 90                   nop
...
0xffffffffbc794614: e8 77 fb ff ff          callq   0xffffffffbc794190  ; __traceiter_netlink_extack+0x0 include/trace/events/netlink.h:9
0xffffffffbc794619: 65 ff 0d a8 16 8a 43    decl    %gs:0x438a16a8(%rip)
0xffffffffbc794620: 74 17                   je      0xffffffffbc794639  ; do_trace_netlink_extack+0x69 include/trace/events/netlink.h:9
; do_trace_netlink_extack+0x52 net/netlink/af_netlink.c:157
0xffffffffbc794622: 5d                      popq    %rbp
0xffffffffbc794623: 31 c0                   xorl    %eax, %eax
0xffffffffbc794625: 31 f6                   xorl    %esi, %esi
0xffffffffbc794627: 31 ff                   xorl    %edi, %edi
0xffffffffbc794629: c3                      retq
...

$ sudo ./btrace -d -k perf_trace_netlink_extack
2025/03/16 21:51:42 Disassembling perf_trace_netlink_extack at 0xffffffffbc7951a0 (368 bytes) ..
; perf_trace_netlink_extack+0x0 include/trace/events/netlink.h:9
0xffffffffbc7951a0: 55                      pushq   %rbp
...
0xffffffffbc795298: 4c 8b 4d c8             movq    -0x38(%rbp), %r9
0xffffffffbc79529c: 8b 55 c4                movl    -0x3c(%rbp), %edx
0xffffffffbc79529f: e8 8c 25 42 ff          callq   0xffffffffbbbb7830  ; perf_trace_run_bpf_submit+0x0 kernel/events/core.c:10312
0xffffffffbc7952a4: 59                      popq    %rcx
...

$ sudo ./btrace -d -k perf_trace_run_bpf_submit
2025/03/16 21:54:09 Disassembling perf_trace_run_bpf_submit at 0xffffffffbbbb7830 (224 bytes) ..
; perf_trace_run_bpf_submit+0x0 kernel/events/core.c:10312
0xffffffffbbbb7830: 0f 1f 44 00 00          nopl    (%rax, %rax)
...
; perf_trace_run_bpf_submit+0x38 kernel/events/core.c:10315
0xffffffffbbbb7868: 48 89 fe                movq    %rdi, %rsi
0xffffffffbbbb786b: 48 89 cf                movq    %rcx, %rdi
0xffffffffbbbb786e: e8 5d 92 f3 ff          callq   0xffffffffbbaf0ad0  ; trace_call_bpf+0x0 kernel/trace/bpf_trace.c:112
0xffffffffbbbb7873: 85 c0                   testl   %eax, %eax

$ echo ...

分析 raw_tp/tp_btf 的 bpf 程序被调用前的函数调用栈、逻辑分支记录栈和反汇编的机器码

raw_tp/tp_btf 学习资料推荐:

raw_tptp_btf 声称比 tracepoint 性能更优的动态追踪技术;而它们之间的区别在于 tp_btf 使用了 BTF 信息;它们的运行时是没有区别的。

netlink/netlink_extack tp_btf 为例。

raw_tp/tp_btf 的 bpf 程序被调用前的函数调用栈

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
$ sudo ./btrace -p 221 --output-stack --limit-events 1 -v
...
2025/03/16 22:26:04 btrace is running..
tp__netlink_extack[bpf] args=((unsigned long long *)ctx=0xffffa0c74a59b7f0(0xffffffffbcf7de10)) retval=(int)0 cpu=4 process=(18217:fentry_fexit-tr)
Func stack:
  ...
  0xffffffffc0383655:tp__netlink_extack+0x5                             ; tp_btf.c:12
  0xffffffffbc7946e9:__bpf_trace_netlink_extack+0x9                     ; include/trace/events/netlink.h:9
  0xffffffffbc794619:do_trace_netlink_extack+0x49                       ; include/trace/events/netlink.h:9
  0xffffffffbc787308:__tcf_qdisc_find.part.0+0x1a8                      ; net/sched/cls_api.c:1214
  0xffffffffbc78c95a:tc_new_tfilter+0x16a                               ; net/sched/cls_api.c:1195
  0xffffffffbc71a400:rtnetlink_rcv_msg+0x380                            ; net/core/rtnetlink.c:6638
  0xffffffffbc79a872:netlink_rcv_skb+0x52                               ; net/netlink/af_netlink.c:2550
  0xffffffffbc7180c5:rtnetlink_rcv+0x15                                 ; net/core/rtnetlink.c:6666
  0xffffffffbc799fa5:netlink_unicast+0x245                              ; net/netlink/af_netlink.c:1331
  0xffffffffbc79a314:netlink_sendmsg+0x214                              ; net/netlink/af_netlink.c:1901
  ...

tp_btf 确实比 tracepoint 的性能更好,从 __tcf_qdisc_findtp__netlink_extack bpf 程序有 3 个函数调用深度,少了 1 个函数调用。

raw_tp/tp_btf 的 bpf 程序被调用前的逻辑分支记录栈

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
$ sudo ./btrace -p 221 --mode entry --output-lbr --limit-events 1 -v
...
2025/03/16 22:33:05 btrace is running..
tp__netlink_extack[bpf] args=((unsigned long long *)ctx=0xffffa0c74a2136c0(0xffffffffbcf7de10)) cpu=0 process=(18217:fentry_fexit-tr)
LBR stack:
[#31] ...
      0xffffffffbc785fc2:qdisc_lookup_rcu+0x62                     (net/sched/sch_api.c:335)              -> 0xffffffffbc787220:__tcf_qdisc_find.part.0+0xc0              (net/sched/cls_api.c:1212)
      0xffffffffbc78722a:__tcf_qdisc_find.part.0+0xca              (net/sched/cls_api.c:1213)             -> 0xffffffffbc7872fc:__tcf_qdisc_find.part.0+0x19c             (net/sched/cls_api.c:1214)
      0xffffffffbc787303:__tcf_qdisc_find.part.0+0x1a3             (net/sched/cls_api.c:1214)             -> 0xffffffffbc7945d0:do_trace_netlink_extack+0x0               (net/netlink/af_netlink.c:155)
      0xffffffffbc7945d5:do_trace_netlink_extack+0x5               (net/netlink/af_netlink.c:155)         -> 0xffffffffbc7945e2:do_trace_netlink_extack+0x12              (include/trace/events/netlink.h:9)
      0xffffffffbc794614:do_trace_netlink_extack+0x44              (include/trace/events/netlink.h:9)     -> 0xffffffffbc7946e0:__bpf_trace_netlink_extack+0x0            (include/trace/events/netlink.h:9)
      0xffffffffbc7946e4:__bpf_trace_netlink_extack+0x4            (include/trace/events/netlink.h:9)     -> 0xffffffffbbaebe90:bpf_trace_run1+0x0                        (kernel/trace/bpf_trace.c:2446)
      0xffffffffbbaebeff:bpf_trace_run1+0x6f                       (kernel/trace/bpf_trace.c:2402)        -> 0xffffffffbb9d3a20:__rcu_read_lock+0x0                       (kernel/rcu/tree_plugin.h:411)
      0xffffffffbb9d3a45:__rcu_read_lock+0x25                      (kernel/rcu/tree_plugin.h:418)         -> 0xffffffffbbaebf04:bpf_trace_run1+0x74                       (arch/x86/include/asm/jump_label.h:27)
      0xffffffffbbaebf12:bpf_trace_run1+0x82                       (include/linux/bpf.h:1243)             -> 0xffffffffc0383650:tp__netlink_extack+0x0                    (tp_btf.c:12)
      0xffffffffc0383650:tp__netlink_extack+0x0                    (tp_btf.c:12)                          -> 0xffffffffc03a1c00:bpf_trampoline_949187772417+0x0
      ...

这次终于能完整地记录从 __tcf_qdisc_findtp__netlink_extack bpf 程序的逻辑分支记录栈了,总共 8 个逻辑分支。

raw_tp/tp_btf 的 bpf 程序被调用前的反汇编的机器码

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
$ sudo ./btrace -d -k __tcf_qdisc_find.part.0
2025/03/16 21:34:08 Disassembling __tcf_qdisc_find.part.0 at 0xffffffffbc787160 (464 bytes) ..
; __tcf_qdisc_find.part.0+0x0 net/sched/cls_api.c:1187
0xffffffffbc787160: 0f 1f 44 00 00          nopl    (%rax, %rax)
...
; __tcf_qdisc_find.part.0+0xb5 net/sched/cls_api.c:1212
0xffffffffbc787215: 66 31 f6                xorw    %si, %si
0xffffffffbc787218: 48 89 c7                movq    %rax, %rdi
0xffffffffbc78721b: e8 40 ed ff ff          callq   0xffffffffbc785f60  ; qdisc_lookup_rcu+0x0 net/sched/sch_api.c:319
0xffffffffbc787220: 49 89 04 24             movq    %rax, (%r12)
0xffffffffbc787224: 48 89 c3                movq    %rax, %rbx
; __tcf_qdisc_find.part.0+0xc7 net/sched/cls_api.c:1213
0xffffffffbc787227: 48 85 c0                testq   %rax, %rax
0xffffffffbc78722a: 0f 84 cc 00 00 00       je      0xffffffffbc7872fc  ; __tcf_qdisc_find.part.0+0x19c net/sched/cls_api.c:1214
...
; __tcf_qdisc_find.part.0+0x19c net/sched/cls_api.c:1214
0xffffffffbc7872fc: 48 c7 c7 10 de f7 bc    movq    $-0x430821f0, %rdi
0xffffffffbc787303: e8 c8 d2 00 00          callq   0xffffffffbc7945d0  ; do_trace_netlink_extack+0x0 net/netlink/af_netlink.c:155
0xffffffffbc787308: 4d 85 ed                testq   %r13, %r13
0xffffffffbc78730b: 0f 84 74 ff ff ff       je      0xffffffffbc787285  ; __tcf_qdisc_find.part.0+0x125 net/sched/cls_api.c:1215
0xffffffffbc787311: 49 c7 45 00 10 de f7 bc movq    $-0x430821f0, (%r13)
0xffffffffbc787319: e9 67 ff ff ff          jmp     0xffffffffbc787285  ; __tcf_qdisc_find.part.0+0x125 net/sched/cls_api.c:1215
0xffffffffbc78731e: e8 6d 28 25 ff          callq   0xffffffffbb9d9b90  ; __rcu_read_unlock+0x0 kernel/rcu/tree_plugin.h:429
; __tcf_qdisc_find.part.0+0x1c3 net/sched/cls_api.c:1204
0xffffffffbc787323: bb ed ff ff ff          movl    $0xffffffed, %ebx
0xffffffffbc787328: e9 c5 fe ff ff          jmp 0xffffffffbc7871f2      ; __tcf_qdisc_find.part.0+0x92 net/sched/cls_api.c:1260
0xffffffffbc78732d: 0f 1f 00                nopl    (%rax)

$ sudo ./btrace -d -k do_trace_netlink_extack
2025/03/16 22:43:02 Disassembling do_trace_netlink_extack at 0xffffffffbc7945d0 (128 bytes) ..
; do_trace_netlink_extack+0x0 net/netlink/af_netlink.c:155
0xffffffffbc7945d0: 0f 1f 44 00 00          nopl    (%rax, %rax)
...
; do_trace_netlink_extack+0x2e include/trace/events/netlink.h:9
0xffffffffbc7945fe: 48 8b 05 83 db a8 01    movq    0x1a8db83(%rip), %rax
0xffffffffbc794605: 48 85 c0                testq   %rax, %rax
0xffffffffbc794608: 74 0f                   je      0xffffffffbc794619  ; do_trace_netlink_extack+0x49 include/trace/events/netlink.h:9
0xffffffffbc79460a: 48 8b 40 08             movq    8(%rax), %rax
0xffffffffbc79460e: 48 89 fe                movq    %rdi, %rsi
0xffffffffbc794611: 48 89 c7                movq    %rax, %rdi
0xffffffffbc794614: e8 c7 00 00 00          callq   0xffffffffbc7946e0  ; __bpf_trace_netlink_extack+0x0 include/trace/events/netlink.h:9
0xffffffffbc794619: 65 ff 0d a8 16 8a 43    decl    %gs:0x438a16a8(%rip)
...

$ sudo  ./btrace -d -k __bpf_trace_netlink_extack
2025/03/16 22:44:56 Disassembling __bpf_trace_netlink_extack at 0xffffffffbc7946e0 (31 bytes) ..
; __bpf_trace_netlink_extack+0x0 include/trace/events/netlink.h:9
0xffffffffbc7946e0: 55                      pushq   %rbp
0xffffffffbc7946e1: 48 89 e5                movq    %rsp, %rbp
0xffffffffbc7946e4: e8 a7 77 35 ff          callq   0xffffffffbbaebe90  ; bpf_trace_run1+0x0 kernel/trace/bpf_trace.c:2446
0xffffffffbc7946e9: 5d                      popq    %rbp
0xffffffffbc7946ea: 31 f6                   xorl    %esi, %esi
0xffffffffbc7946ec: 31 ff                   xorl    %edi, %edi
0xffffffffbc7946ee: c3                      retq

$ sudo ./btrace -d -k bpf_trace_run1
2025/03/16 22:46:58 Disassembling bpf_trace_run1 at 0xffffffffbbaebe90 (272 bytes) ..
; bpf_trace_run1+0x0 kernel/trace/bpf_trace.c:2446
0xffffffffbbaebe90: 55                      pushq   %rbp
...
; bpf_trace_run1+0x7e include/linux/bpf.h:1243
0xffffffffbbaebf0e: 48 8d 7d d0             leaq    -0x30(%rbp), %rdi
0xffffffffbbaebf12: ff d0                   callq   *%rax
0xffffffffbbaebf14: 0f 1f 00                nopl    (%rax)
...

Emm,用的还是 indirect call ……

分析 kprobe 的 bpf 程序被调用前的函数调用栈、逻辑分支记录栈和反汇编的机器码

kprobe 学习资料推荐:

使用 eBPF Talk: trace kprobe 程序 里的例子,直接使用 btrace 看看 kprobe 里调用 bpf 程序的函数调用栈:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
$ sudo ./btrace -p 254 --output-stack --limit-events 1 -v
...
2025/03/16 23:03:57 btrace is running..
k_tcp_connect[bpf] args=((struct pt_regs *)ctx=0xffffa0c7446279c0) retval=(int)0 cpu=11 process=(21237:curl)
Func stack:
  ...
  0xffffffffc038371d:k_tcp_connect+0x5                                  ; tcp-connecting.c:31
  0xffffffffbbaf54da:kprobe_perf_func+0x5a                              ; kernel/trace/trace_kprobe.c:1673
  0xffffffffbbaf57bd:kprobe_dispatcher+0x6d                             ; kernel/trace/trace_kprobe.c:1809
  0xffffffffbb8c0983:kprobe_ftrace_handler+0x153                        ; arch/x86/kernel/kprobes/ftrace.c:45
  0xffffffffc1eb00f5:+0x0
  0xffffffffbc7fe131:tcp_connect+0x1                                    ; net/ipv4/tcp_output.c:4066
  0xffffffffbc82f323:__inet_stream_connect+0x113                        ; net/ipv4/af_inet.c:679
  0xffffffffbc82f62b:inet_stream_connect+0x3b                           ; net/ipv4/af_inet.c:751
  0xffffffffbc6c9bab:__sys_connect_file+0x6b                            ; net/socket.c:2061
  0xffffffffbc6c9c9f:__sys_connect+0xbf                                 ; include/linux/file.h:47
  0xffffffffbc6c9cf8:__x64_sys_connect+0x18                             ; net/socket.c:2085
  0xffffffffbb80b95f:x64_sys_call+0x212f                                ; arch/x86/entry/syscall_64.c:36
  0xffffffffbcabb25e:do_syscall_64+0x7e                                 ; arch/x86/entry/common.c:52
  0xffffffffbcc0012b:entry_SYSCALL_64_after_hwframe+0x76                ; arch/x86/entry/entry_64.S:130

tcp_connectk_tcp_connect bpf 程序有 5 个函数调用深度。

kprobe 的 bpf 程序被调用前的逻辑分支记录栈

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ sudo ./btrace -p 302 --mode entry --output-lbr --limit-events 1 -v
...
2025/03/16 23:31:32 btrace is running..
k_tcp_connect[bpf] args=((struct pt_regs *)ctx=0xffffa0c743453ba0) cpu=0 process=(23295:curl)
LBR stack:
[#31] ...
      0xffffffffbc7fe130:tcp_connect+0x0                           (net/ipv4/tcp_output.c:4066)           -> 0xffffffffc1eb0000:ftrace_trampoline+0x0
[#30] 0xffffffffc1eb00f0:+0x0                                                                             -> 0xffffffffbb8c0830:kprobe_ftrace_handler+0x0                 (arch/x86/kernel/kprobes/ftrace.c:18)
      0xffffffffbb8c08e4:kprobe_ftrace_handler+0xb4                (arch/x86/kernel/kprobes/ftrace.c:31)  -> 0xffffffffbba69850:get_kprobe+0x0                            (kernel/kprobes.c:377)
      0xffffffffbba69876:get_kprobe+0x26                           (kernel/kprobes.c:382)                 -> 0xffffffffbba69888:get_kprobe+0x38                           (kernel/kprobes.c:384)
      0xffffffffbba69893:get_kprobe+0x43                           (kernel/kprobes.c:389)                 -> 0xffffffffbb8c08e9:kprobe_ftrace_handler+0xb9                (arch/x86/kernel/kprobes/ftrace.c:31)
      0xffffffffbb8c0900:kprobe_ftrace_handler+0xd0                (arch/x86/kernel/kprobes/ftrace.c:36)  -> 0xffffffffbb8c093b:kprobe_ftrace_handler+0x10b               (include/linux/kprobes.h:408)
      0xffffffffbb8c0981:kprobe_ftrace_handler+0x151               (arch/x86/kernel/kprobes/ftrace.c:45)  -> 0xffffffffbbaf5750:kprobe_dispatcher+0x0                     (kernel/trace/trace_kprobe.c:1796)
      0xffffffffbbaf577d:kprobe_dispatcher+0x2d                    (kernel/trace/trace_kprobe.c:1805)     -> 0xffffffffbbaf57b5:kprobe_dispatcher+0x65                    (kernel/trace/trace_kprobe.c:1806)
      0xffffffffbbaf57b8:kprobe_dispatcher+0x68                    (kernel/trace/trace_kprobe.c:1806)     -> 0xffffffffbbaf5480:kprobe_perf_func+0x0                      (kernel/trace/trace_kprobe.c:1654)
      0xffffffffbbaf54d5:kprobe_perf_func+0x55                     (kernel/trace/trace_kprobe.c:1665)     -> 0xffffffffbbaf0ad0:trace_call_bpf+0x0                        (kernel/trace/bpf_trace.c:112)
      0xffffffffbbaf0b0a:trace_call_bpf+0x3a                       (kernel/trace/bpf_trace.c:117)         -> 0xffffffffbb9d3a20:__rcu_read_lock+0x0                       (kernel/rcu/tree_plugin.h:411)
      0xffffffffbb9d3a45:__rcu_read_lock+0x25                      (kernel/rcu/tree_plugin.h:418)         -> 0xffffffffbbaf0b0f:trace_call_bpf+0x3f                       (kernel/trace/bpf_trace.c:147)
      0xffffffffbbaf0b38:trace_call_bpf+0x68                       (include/linux/bpf.h:2099)             -> 0xffffffffbb952360:migrate_disable+0x0                       (kernel/sched/core.c:2232)
      0xffffffffbb952377:migrate_disable+0x17                      (kernel/sched/core.c:2235)             -> 0xffffffffbb95238c:migrate_disable+0x2c                      (kernel/sched/core.c:2237)
      0xffffffffbb9523bb:migrate_disable+0x5b                      (include/linux/preempt.h:480)          -> 0xffffffffbbaf0b3d:trace_call_bpf+0x6d                       (include/linux/bpf.h:2099)
      0xffffffffbbaf0b84:trace_call_bpf+0xb4                       (include/linux/bpf.h:1243)             -> 0xffffffffc0383708:k_tcp_connect+0x0                         (tcp-connecting.c:31)
      0xffffffffc0383708:k_tcp_connect+0x0                         (tcp-connecting.c:31)                  -> 0xffffffffc03a1ac0:bpf_trampoline_1297080123393+0x0
      ...

tcp_connectk_tcp_connect bpf 程序的逻辑分支记录栈有 16 个逻辑分支。

kprobe 的 bpf 程序被调用前的反汇编的机器码

笔者有点懒,请读者自行分析吧。

分析 fentry/fexit 的 bpf 程序被调用前的函数调用栈、逻辑分支记录栈和反汇编的机器码

其实,btrace 使用的就是 fentryfexit;以 kprobe 为例,看看 fentryfexit 的 bpf 程序被调用前的函数调用栈:

1
2
3
4
5
6
7
8
9
$ sudo ./btrace -p 254 --output-stack --limit-events 1 -v
2025/03/16 23:03:57 btrace is running..
k_tcp_connect[bpf] args=((struct pt_regs *)ctx=0xffffa0c7446279c0) retval=(int)0 cpu=11 process=(21237:curl)
Func stack:
  0xffffffffc03a34bf:fexit_fn+0x34b                                     ; bpf/btrace.c:106
  0xffffffffc03a34bf:fexit_fn+0x34b                                     ; bpf/btrace.c:106
  0xffffffffc03a1d77:bpf_trampoline_1090921693185+0xb7
  0xffffffffc038371d:k_tcp_connect+0x5                                  ; tcp-connecting.c:31
  ...

k_tcp_connectfexit_fn 只有 2 个函数调用深度。

看看 fentryfexit 的 bpf 程序被调用前的逻辑分支记录栈:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
$ sudo ./btrace -p 348 --mode entry --output-lbr --limit-events 1 -v
。。。
2025/03/16 23:41:51 btrace is running..
k_tcp_connect[bpf] args=((struct pt_regs *)ctx=0xffffa0c7434dbbe0) cpu=0 process=(24048:curl)
LBR stack:
[#31] ...
      0xffffffffc0383708:k_tcp_connect+0x0                         (tcp-connecting.c:31)                  -> 0xffffffffc03a1ac0:bpf_trampoline_1494648619009+0x0
      0xffffffffc03a1af9:bpf_trampoline_1494648619009+0x39                                                -> 0xffffffffbbb6c5c0:__bpf_prog_enter_recur+0x0
      0xffffffffbbb6c5cd:__bpf_prog_enter_recur+0xd                                                       -> 0xffffffffbb9d3a20:__rcu_read_lock+0x0                       (kernel/rcu/tree_plugin.h:411)
      0xffffffffbb9d3a45:__rcu_read_lock+0x25                      (kernel/rcu/tree_plugin.h:418)         -> 0xffffffffbbb6c5d2:__bpf_prog_enter_recur+0x12
      0xffffffffbbb6c5d2:__bpf_prog_enter_recur+0x12                                                      -> 0xffffffffbb952360:migrate_disable+0x0                       (kernel/sched/core.c:2232)
      0xffffffffbb952387:migrate_disable+0x27                      (kernel/sched/core.c:2237)             -> 0xffffffffbbb6c5d7:__bpf_prog_enter_recur+0x17
      0xffffffffbbb6c614:__bpf_prog_enter_recur+0x54                                                      -> 0xffffffffc03a1afe:bpf_trampoline_1494648619009+0x3e
      0xffffffffc03a1b0a:bpf_trampoline_1494648619009+0x4a                                                -> 0xffffffffc03a3160:fentry_fn+0x0                             (bpf/btrace.c:129)
      0xffffffffc03a3200:fentry_fn+0xa0                            (bpf/btrace.c:83)                      -> 0xffffffffbb81e680:intel_pmu_snapshot_arch_branch_stack+0x0  (arch/x86/events/intel/core.c:2359)
      ...

k_tcp_connectfentry_fn 有 8 个逻辑分支。

有点不合理;这 8 个逻辑分支完全没有存在的必要;笔者已经计划提 patch 将这些无用的逻辑分支记录优化掉。

快速看看 fentryfexit 的 bpf 程序被调用前的反汇编的机器码:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
$ sudo ./btrace -d -k bpf_trampoline_1494648619009 -B 500
2025/03/16 23:44:18 Disassembling bpf_trampoline_1494648619009 at 0xffffffffc03a1ac0 (183 bytes) ..
; bpf_trampoline_1494648619009+0x0
0xffffffffc03a1ac0: 55                    pushq   %rbp
0xffffffffc03a1ac1: 48 89 e5              movq    %rsp, %rbp
0xffffffffc03a1ac4: 48 83 ec 30           subq    $0x30, %rsp
0xffffffffc03a1ac8: 50                    pushq   %rax
0xffffffffc03a1ac9: 48 89 5d e0           movq    %rbx, -0x20(%rbp)
0xffffffffc03a1acd: b8 01 00 00 00        movl    $1, %eax
0xffffffffc03a1ad2: 48 89 45 f0           movq    %rax, -0x10(%rbp)
0xffffffffc03a1ad6: 48 c7 c0 08 37 38 c0  movq    $-0x3fc7c8f8, %rax
0xffffffffc03a1add: 48 89 45 e8           movq    %rax, -0x18(%rbp)
0xffffffffc03a1ae1: 48 89 7d f8           movq    %rdi, -8(%rbp)
0xffffffffc03a1ae5: 31 ff                 xorl    %edi, %edi
0xffffffffc03a1ae7: 48 89 7d d0           movq    %rdi, -0x30(%rbp)
0xffffffffc03a1aeb: 48 bf 00 70 50 40 c7 a0 ff ff   movabsq   $0xffffa0c740507000, %rdi
0xffffffffc03a1af5: 48 8d 75 d0           leaq    -0x30(%rbp), %rsi
0xffffffffc03a1af9: e8 c2 aa 7c fb        callq   0xffffffffbbb6c5c0    ; __bpf_prog_enter_recur+0x0
0xffffffffc03a1afe: 48 89 c3              movq    %rax, %rbx
0xffffffffc03a1b01: 48 85 c0              testq   %rax, %rax
0xffffffffc03a1b04: 74 09                 je      0xffffffffc03a1b0f    ; bpf_trampoline_1494648619009+0x4f
0xffffffffc03a1b06: 48 8d 7d f8           leaq    -8(%rbp), %rdi
0xffffffffc03a1b0a: e8 51 16 00 00        callq   0xffffffffc03a3160    ; fentry_fn+0x0 bpf/btrace.c:129 [bpf]
0xffffffffc03a1b0f: 48 bf 00 70 50 40 c7 a0 ff ff   movabsq   $0xffffa0c740507000, %rdi
0xffffffffc03a1b19: 48 89 de              movq    %rbx, %rsi
0xffffffffc03a1b1c: 48 8d 55 d0           leaq    -0x30(%rbp), %rdx
0xffffffffc03a1b20: e8 db ad 7c fb        callq   0xffffffffbbb6c900    ; __bpf_prog_exit_recur+0x0
0xffffffffc03a1b25: 48 8b 7d f8           movq    -8(%rbp), %rdi
0xffffffffc03a1b29: 48 8b 85 c8 ff ff ff  movq    -0x38(%rbp), %rax
0xffffffffc03a1b30: 48 8b 5d e0           movq    -0x20(%rbp), %rbx
0xffffffffc03a1b34: c9                    leave
0xffffffffc03a1b35: c3                    retq
0xffffffffc03a1b36: cc                    int3

可以看到,在 trampoline 里便直接调用了 fentry_fn

除了 --output-lbr 对 CPU 特性有要求外,-d 不依赖内核,其它功能就需要在 Linux v5.17+ 上才能正常使用。

总结

btrace 提供了一种简单的方法来分析 bpf 程序被调用前的函数调用栈、逻辑分支记录栈和反汇编的机器码,方便开发者快速定位问题。

在分析了 tracepointraw_tp/tp_btfkprobefentry/fexit 的 bpf 程序被调用前的函数调用栈、逻辑分支记录栈和反汇编的机器码后,发现 fentry/fexit 的 bpf 程序被调用前的函数调用栈和逻辑分支记录栈最简洁,而 tracepoint 的 bpf 程序被调用前的逻辑分支记录栈最复杂。

笔者推荐,能用 fentry/fexit 就用 fentry/fexit;需要使用 tracepoint 时,推荐使用 raw_tp/tp_btf