凑个热闹,静态地对比一下 tracepoint、raw_tp/tp_btf、kprobe 和 fentry/fexit 四种 bpf 动态追踪技术的运行时性能。
什么?看到那位大佬用调试内核的方式来对比 tracepoint 和 kprobe 的性能,简直牛刀小用,可操作性不是普通人能及的。
那么,既然已经到了 bpf 时代,有没有使用 bpf 且简单可行的方式来对比他们的运行时性能呢?
btrace
是一款 bpf 时代的现代化内核函数动态追踪工具,能够用来追踪 tracee 的函数调用栈、逻辑分支记录栈,也能够通过反汇编的方式分析 tracee 的机器码。
本文将使用 btrace
来静态地对比它们的运行时性能。
P.S. btrace
项目源代码:github.com/leonhwangprojects/btrace。
实验环境
这是一台 Intel CPU 的小主机,安装了 Ubuntu 24.10 LTS 操作系统;该 Intel CPU 支持 LBR 特性。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
$ dmesg | grep -i lbr
[ 0.101408] Performance Events: XSAVE Architectural LBR, PEBS fmt4+-baseline, AnyThread deprecated, Alderlake Hybrid events, 32-deep LBR, full-width counters, Intel PMU driver.
$ uname -a
Linux bpf-dev 6.11.0-8-generic #8-Ubuntu SMP PREEMPT_DYNAMIC Mon Sep 16 13:41:20 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 24.10
Release: 24.10
Codename: oracular
$ clang --version
Ubuntu clang version 18.1.8 (11)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
$ go version
go version go1.24.0 linux/amd64
|
分析 tracepoint 的 bpf 程序被调用前的函数调用栈、逻辑分支记录栈和反汇编的机器码
tracepoint 学习资料推荐:
以 netlink/netlink_extack
tracepoint 为例:
1
2
3
4
5
6
7
8
9
10
11
12
|
$ cat /sys/kernel/debug/tracing/events/netlink/netlink_extack/format
name: netlink_extack
ID: 1680
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:__data_loc char[] msg; offset:8; size:4; signed:0;
print fmt: "msg=%s", __get_str(msg)
|
源代码:
1
2
3
4
5
6
7
|
// net/netlink/af_netlink.c
void do_trace_netlink_extack(const char *msg)
{
trace_netlink_extack(msg);
}
EXPORT_SYMBOL(do_trace_netlink_extack);
|
tracepoint 的 bpf 程序被调用前的函数调用栈
使用 eBPF Talk: trace tracepoint 程序 里的例子,直接使用 btrace
看看 tracepoint 里调用 bpf 程序的函数调用栈:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
$ sudo ./btrace -p 183 --output-stack --limit-events 1 -v
...
2025/03/16 21:34:53 btrace is running..
tp__netlink_extack[bpf] args=((struct netlink_extack_error_ctx *)ctx=0xffffc0c73f208470) retval=(int)0 cpu=8 process=(14792:fentry_fexit-tr)
Func stack:
...
0xffffffffc03836a1:tp__netlink_extack+0x5 ; tracepoint.c:27
0xffffffffbbbb7873:perf_trace_run_bpf_submit+0x43 ; kernel/events/core.c:10315
0xffffffffbc7952a4:perf_trace_netlink_extack+0x104 ; include/trace/events/netlink.h:9
0xffffffffbc794619:do_trace_netlink_extack+0x49 ; include/trace/events/netlink.h:9
0xffffffffbc787308:__tcf_qdisc_find.part.0+0x1a8 ; net/sched/cls_api.c:1214
0xffffffffbc78c95a:tc_new_tfilter+0x16a ; net/sched/cls_api.c:1195
0xffffffffbc71a400:rtnetlink_rcv_msg+0x380 ; net/core/rtnetlink.c:6638
0xffffffffbc79a872:netlink_rcv_skb+0x52 ; net/netlink/af_netlink.c:2550
0xffffffffbc7180c5:rtnetlink_rcv+0x15 ; net/core/rtnetlink.c:6666
0xffffffffbc799fa5:netlink_unicast+0x245 ; net/netlink/af_netlink.c:1331
0xffffffffbc79a314:netlink_sendmsg+0x214 ; net/netlink/af_netlink.c:1901
...
|
可以看到,从 __tcf_qdisc_find
到 tp__netlink_extack
bpf 程序有 4 个函数调用深度。
tracepoint 的 bpf 程序被调用前的逻辑分支记录栈
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
$ sudo ./btrace -p 321 --mode entry --output-lbr --limit-events 1 -v
...
2025/03/16 23:36:20 btrace is running..
tp__netlink_extack[bpf] args=((struct netlink_extack_error_ctx *)ctx=0xffffc0c73ef10ec0) cpu=2 process=(23656:fentry_fexit-tr)
LBR stack:
[#31] 0xffffffffbc9b4937:strlen+0x17 (lib/string.c:402) -> 0xffffffffbc9b4930:strlen+0x10 (lib/string.c:402)
0xffffffffbc9b4937:strlen+0x17 (lib/string.c:402) -> 0xffffffffbc9b4930:strlen+0x10 (lib/string.c:402)
0xffffffffbc9b4937:strlen+0x17 (lib/string.c:402) -> 0xffffffffbc9b4930:strlen+0x10 (lib/string.c:402)
0xffffffffbc9b4937:strlen+0x17 (lib/string.c:402) -> 0xffffffffbc9b4930:strlen+0x10 (lib/string.c:402)
0xffffffffbc9b4937:strlen+0x17 (lib/string.c:402) -> 0xffffffffbc9b4930:strlen+0x10 (lib/string.c:402)
0xffffffffbc9b493e:strlen+0x1e (lib/string.c:404) -> 0xffffffffbc7951e4:perf_trace_netlink_extack+0x44 (include/trace/events/netlink.h:9)
0xffffffffbc795221:perf_trace_netlink_extack+0x81 (include/trace/events/netlink.h:9) -> 0xffffffffbbacbc60:perf_trace_buf_alloc+0x0 (kernel/trace/trace_event_perf.c:393)
0xffffffffbbacbc81:perf_trace_buf_alloc+0x21 (kernel/trace/trace_event_perf.c:404) -> 0xffffffffbbba7350:perf_swevent_get_recursion_context+0x0 (kernel/events/core.c:10000)
0xffffffffbbba73a5:perf_swevent_get_recursion_context+0x55 (kernel/events/core.c:10002) -> 0xffffffffbbacbc86:perf_trace_buf_alloc+0x26 (kernel/trace/trace_event_perf.c:404)
0xffffffffbbacbceb:perf_trace_buf_alloc+0x8b (kernel/trace/trace_event_perf.c:415) -> 0xffffffffbc795226:perf_trace_netlink_extack+0x86 (include/trace/events/netlink.h:9)
0xffffffffbc795276:perf_trace_netlink_extack+0xd6 (include/trace/events/netlink.h:9) -> 0xffffffffbcac4fe0:__memcpy+0x0 (arch/x86/lib/memcpy_64.S:34)
0xffffffffbcac4fea:__memcpy+0xa (arch/x86/lib/memcpy_64.S:39) -> 0xffffffffbc79527b:perf_trace_netlink_extack+0xdb (include/trace/events/netlink.h:9)
0xffffffffbc79529f:perf_trace_netlink_extack+0xff (include/trace/events/netlink.h:9) -> 0xffffffffbbbb7830:perf_trace_run_bpf_submit+0x0 (kernel/events/core.c:10312)
0xffffffffbbbb786e:perf_trace_run_bpf_submit+0x3e (kernel/events/core.c:10315) -> 0xffffffffbbaf0ad0:trace_call_bpf+0x0 (kernel/trace/bpf_trace.c:112)
0xffffffffbbaf0b0a:trace_call_bpf+0x3a (kernel/trace/bpf_trace.c:117) -> 0xffffffffbb9d3a20:__rcu_read_lock+0x0 (kernel/rcu/tree_plugin.h:411)
0xffffffffbb9d3a45:__rcu_read_lock+0x25 (kernel/rcu/tree_plugin.h:418) -> 0xffffffffbbaf0b0f:trace_call_bpf+0x3f (kernel/trace/bpf_trace.c:147)
0xffffffffbbaf0b38:trace_call_bpf+0x68 (include/linux/bpf.h:2099) -> 0xffffffffbb952360:migrate_disable+0x0 (kernel/sched/core.c:2232)
0xffffffffbb952377:migrate_disable+0x17 (kernel/sched/core.c:2235) -> 0xffffffffbb95238c:migrate_disable+0x2c (kernel/sched/core.c:2237)
0xffffffffbb9523bb:migrate_disable+0x5b (include/linux/preempt.h:480) -> 0xffffffffbbaf0b3d:trace_call_bpf+0x6d (include/linux/bpf.h:2099)
0xffffffffbbaf0b84:trace_call_bpf+0xb4 (include/linux/bpf.h:1243) -> 0xffffffffc0383688:tp__netlink_extack+0x0 (tracepoint.c:27)
0xffffffffc0383688:tp__netlink_extack+0x0 (tracepoint.c:27) -> 0xffffffffc03a1dc0:bpf_trampoline_1378684502017+0x0
...
|
Emm,无法完整地记录从 __tcf_qdisc_find
到 tp__netlink_extack
bpf 程序的逻辑分支记录栈;说明期间涉及的逻辑分支有点多,LBR 有限,无法完全记录。
tracepoint 的 bpf 程序被调用前的反汇编的机器码
btrace
提供了对内核函数和 bpf 程序进行反汇编的能力,可以看看 tp__netlink_extack
bpf 程序被调用前的反汇编的机器码:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
|
$ sudo ./btrace -d -k __tcf_qdisc_find.part.0
2025/03/16 21:34:08 Disassembling __tcf_qdisc_find.part.0 at 0xffffffffbc787160 (464 bytes) ..
; __tcf_qdisc_find.part.0+0x0 net/sched/cls_api.c:1187
0xffffffffbc787160: 0f 1f 44 00 00 nopl (%rax, %rax)
...
; __tcf_qdisc_find.part.0+0xb5 net/sched/cls_api.c:1212
0xffffffffbc787215: 66 31 f6 xorw %si, %si
0xffffffffbc787218: 48 89 c7 movq %rax, %rdi
0xffffffffbc78721b: e8 40 ed ff ff callq 0xffffffffbc785f60 ; qdisc_lookup_rcu+0x0 net/sched/sch_api.c:319
0xffffffffbc787220: 49 89 04 24 movq %rax, (%r12)
0xffffffffbc787224: 48 89 c3 movq %rax, %rbx
; __tcf_qdisc_find.part.0+0xc7 net/sched/cls_api.c:1213
0xffffffffbc787227: 48 85 c0 testq %rax, %rax
0xffffffffbc78722a: 0f 84 cc 00 00 00 je 0xffffffffbc7872fc ; __tcf_qdisc_find.part.0+0x19c net/sched/cls_api.c:1214
...
; __tcf_qdisc_find.part.0+0x19c net/sched/cls_api.c:1214
0xffffffffbc7872fc: 48 c7 c7 10 de f7 bc movq $-0x430821f0, %rdi
0xffffffffbc787303: e8 c8 d2 00 00 callq 0xffffffffbc7945d0 ; do_trace_netlink_extack+0x0 net/netlink/af_netlink.c:155
0xffffffffbc787308: 4d 85 ed testq %r13, %r13
0xffffffffbc78730b: 0f 84 74 ff ff ff je 0xffffffffbc787285 ; __tcf_qdisc_find.part.0+0x125 net/sched/cls_api.c:1215
0xffffffffbc787311: 49 c7 45 00 10 de f7 bc movq $-0x430821f0, (%r13)
0xffffffffbc787319: e9 67 ff ff ff jmp 0xffffffffbc787285 ; __tcf_qdisc_find.part.0+0x125 net/sched/cls_api.c:1215
0xffffffffbc78731e: e8 6d 28 25 ff callq 0xffffffffbb9d9b90 ; __rcu_read_unlock+0x0 kernel/rcu/tree_plugin.h:429
; __tcf_qdisc_find.part.0+0x1c3 net/sched/cls_api.c:1204
0xffffffffbc787323: bb ed ff ff ff movl $0xffffffed, %ebx
0xffffffffbc787328: e9 c5 fe ff ff jmp 0xffffffffbc7871f2 ; __tcf_qdisc_find.part.0+0x92 net/sched/cls_api.c:1260
0xffffffffbc78732d: 0f 1f 00 nopl (%rax)
$ sudo ./btrace -d -k do_trace_netlink_extack
2025/03/16 21:47:23 Disassembling do_trace_netlink_extack at 0xffffffffbc7945d0 (128 bytes) ..
; do_trace_netlink_extack+0x0 net/netlink/af_netlink.c:155
0xffffffffbc7945d0: 0f 1f 44 00 00 nopl (%rax, %rax)
0xffffffffbc7945d5: 66 90 nop
...
0xffffffffbc794614: e8 77 fb ff ff callq 0xffffffffbc794190 ; __traceiter_netlink_extack+0x0 include/trace/events/netlink.h:9
0xffffffffbc794619: 65 ff 0d a8 16 8a 43 decl %gs:0x438a16a8(%rip)
0xffffffffbc794620: 74 17 je 0xffffffffbc794639 ; do_trace_netlink_extack+0x69 include/trace/events/netlink.h:9
; do_trace_netlink_extack+0x52 net/netlink/af_netlink.c:157
0xffffffffbc794622: 5d popq %rbp
0xffffffffbc794623: 31 c0 xorl %eax, %eax
0xffffffffbc794625: 31 f6 xorl %esi, %esi
0xffffffffbc794627: 31 ff xorl %edi, %edi
0xffffffffbc794629: c3 retq
...
$ sudo ./btrace -d -k perf_trace_netlink_extack
2025/03/16 21:51:42 Disassembling perf_trace_netlink_extack at 0xffffffffbc7951a0 (368 bytes) ..
; perf_trace_netlink_extack+0x0 include/trace/events/netlink.h:9
0xffffffffbc7951a0: 55 pushq %rbp
...
0xffffffffbc795298: 4c 8b 4d c8 movq -0x38(%rbp), %r9
0xffffffffbc79529c: 8b 55 c4 movl -0x3c(%rbp), %edx
0xffffffffbc79529f: e8 8c 25 42 ff callq 0xffffffffbbbb7830 ; perf_trace_run_bpf_submit+0x0 kernel/events/core.c:10312
0xffffffffbc7952a4: 59 popq %rcx
...
$ sudo ./btrace -d -k perf_trace_run_bpf_submit
2025/03/16 21:54:09 Disassembling perf_trace_run_bpf_submit at 0xffffffffbbbb7830 (224 bytes) ..
; perf_trace_run_bpf_submit+0x0 kernel/events/core.c:10312
0xffffffffbbbb7830: 0f 1f 44 00 00 nopl (%rax, %rax)
...
; perf_trace_run_bpf_submit+0x38 kernel/events/core.c:10315
0xffffffffbbbb7868: 48 89 fe movq %rdi, %rsi
0xffffffffbbbb786b: 48 89 cf movq %rcx, %rdi
0xffffffffbbbb786e: e8 5d 92 f3 ff callq 0xffffffffbbaf0ad0 ; trace_call_bpf+0x0 kernel/trace/bpf_trace.c:112
0xffffffffbbbb7873: 85 c0 testl %eax, %eax
$ echo ...
|
分析 raw_tp/tp_btf 的 bpf 程序被调用前的函数调用栈、逻辑分支记录栈和反汇编的机器码
raw_tp/tp_btf 学习资料推荐:
raw_tp
和 tp_btf
声称比 tracepoint
性能更优的动态追踪技术;而它们之间的区别在于 tp_btf
使用了 BTF 信息;它们的运行时是没有区别的。
以 netlink/netlink_extack
tp_btf 为例。
raw_tp/tp_btf 的 bpf 程序被调用前的函数调用栈
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
$ sudo ./btrace -p 221 --output-stack --limit-events 1 -v
...
2025/03/16 22:26:04 btrace is running..
tp__netlink_extack[bpf] args=((unsigned long long *)ctx=0xffffa0c74a59b7f0(0xffffffffbcf7de10)) retval=(int)0 cpu=4 process=(18217:fentry_fexit-tr)
Func stack:
...
0xffffffffc0383655:tp__netlink_extack+0x5 ; tp_btf.c:12
0xffffffffbc7946e9:__bpf_trace_netlink_extack+0x9 ; include/trace/events/netlink.h:9
0xffffffffbc794619:do_trace_netlink_extack+0x49 ; include/trace/events/netlink.h:9
0xffffffffbc787308:__tcf_qdisc_find.part.0+0x1a8 ; net/sched/cls_api.c:1214
0xffffffffbc78c95a:tc_new_tfilter+0x16a ; net/sched/cls_api.c:1195
0xffffffffbc71a400:rtnetlink_rcv_msg+0x380 ; net/core/rtnetlink.c:6638
0xffffffffbc79a872:netlink_rcv_skb+0x52 ; net/netlink/af_netlink.c:2550
0xffffffffbc7180c5:rtnetlink_rcv+0x15 ; net/core/rtnetlink.c:6666
0xffffffffbc799fa5:netlink_unicast+0x245 ; net/netlink/af_netlink.c:1331
0xffffffffbc79a314:netlink_sendmsg+0x214 ; net/netlink/af_netlink.c:1901
...
|
tp_btf
确实比 tracepoint
的性能更好,从 __tcf_qdisc_find
到 tp__netlink_extack
bpf 程序有 3 个函数调用深度,少了 1 个函数调用。
raw_tp/tp_btf 的 bpf 程序被调用前的逻辑分支记录栈
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
$ sudo ./btrace -p 221 --mode entry --output-lbr --limit-events 1 -v
...
2025/03/16 22:33:05 btrace is running..
tp__netlink_extack[bpf] args=((unsigned long long *)ctx=0xffffa0c74a2136c0(0xffffffffbcf7de10)) cpu=0 process=(18217:fentry_fexit-tr)
LBR stack:
[#31] ...
0xffffffffbc785fc2:qdisc_lookup_rcu+0x62 (net/sched/sch_api.c:335) -> 0xffffffffbc787220:__tcf_qdisc_find.part.0+0xc0 (net/sched/cls_api.c:1212)
0xffffffffbc78722a:__tcf_qdisc_find.part.0+0xca (net/sched/cls_api.c:1213) -> 0xffffffffbc7872fc:__tcf_qdisc_find.part.0+0x19c (net/sched/cls_api.c:1214)
0xffffffffbc787303:__tcf_qdisc_find.part.0+0x1a3 (net/sched/cls_api.c:1214) -> 0xffffffffbc7945d0:do_trace_netlink_extack+0x0 (net/netlink/af_netlink.c:155)
0xffffffffbc7945d5:do_trace_netlink_extack+0x5 (net/netlink/af_netlink.c:155) -> 0xffffffffbc7945e2:do_trace_netlink_extack+0x12 (include/trace/events/netlink.h:9)
0xffffffffbc794614:do_trace_netlink_extack+0x44 (include/trace/events/netlink.h:9) -> 0xffffffffbc7946e0:__bpf_trace_netlink_extack+0x0 (include/trace/events/netlink.h:9)
0xffffffffbc7946e4:__bpf_trace_netlink_extack+0x4 (include/trace/events/netlink.h:9) -> 0xffffffffbbaebe90:bpf_trace_run1+0x0 (kernel/trace/bpf_trace.c:2446)
0xffffffffbbaebeff:bpf_trace_run1+0x6f (kernel/trace/bpf_trace.c:2402) -> 0xffffffffbb9d3a20:__rcu_read_lock+0x0 (kernel/rcu/tree_plugin.h:411)
0xffffffffbb9d3a45:__rcu_read_lock+0x25 (kernel/rcu/tree_plugin.h:418) -> 0xffffffffbbaebf04:bpf_trace_run1+0x74 (arch/x86/include/asm/jump_label.h:27)
0xffffffffbbaebf12:bpf_trace_run1+0x82 (include/linux/bpf.h:1243) -> 0xffffffffc0383650:tp__netlink_extack+0x0 (tp_btf.c:12)
0xffffffffc0383650:tp__netlink_extack+0x0 (tp_btf.c:12) -> 0xffffffffc03a1c00:bpf_trampoline_949187772417+0x0
...
|
这次终于能完整地记录从 __tcf_qdisc_find
到 tp__netlink_extack
bpf 程序的逻辑分支记录栈了,总共 8 个逻辑分支。
raw_tp/tp_btf 的 bpf 程序被调用前的反汇编的机器码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
|
$ sudo ./btrace -d -k __tcf_qdisc_find.part.0
2025/03/16 21:34:08 Disassembling __tcf_qdisc_find.part.0 at 0xffffffffbc787160 (464 bytes) ..
; __tcf_qdisc_find.part.0+0x0 net/sched/cls_api.c:1187
0xffffffffbc787160: 0f 1f 44 00 00 nopl (%rax, %rax)
...
; __tcf_qdisc_find.part.0+0xb5 net/sched/cls_api.c:1212
0xffffffffbc787215: 66 31 f6 xorw %si, %si
0xffffffffbc787218: 48 89 c7 movq %rax, %rdi
0xffffffffbc78721b: e8 40 ed ff ff callq 0xffffffffbc785f60 ; qdisc_lookup_rcu+0x0 net/sched/sch_api.c:319
0xffffffffbc787220: 49 89 04 24 movq %rax, (%r12)
0xffffffffbc787224: 48 89 c3 movq %rax, %rbx
; __tcf_qdisc_find.part.0+0xc7 net/sched/cls_api.c:1213
0xffffffffbc787227: 48 85 c0 testq %rax, %rax
0xffffffffbc78722a: 0f 84 cc 00 00 00 je 0xffffffffbc7872fc ; __tcf_qdisc_find.part.0+0x19c net/sched/cls_api.c:1214
...
; __tcf_qdisc_find.part.0+0x19c net/sched/cls_api.c:1214
0xffffffffbc7872fc: 48 c7 c7 10 de f7 bc movq $-0x430821f0, %rdi
0xffffffffbc787303: e8 c8 d2 00 00 callq 0xffffffffbc7945d0 ; do_trace_netlink_extack+0x0 net/netlink/af_netlink.c:155
0xffffffffbc787308: 4d 85 ed testq %r13, %r13
0xffffffffbc78730b: 0f 84 74 ff ff ff je 0xffffffffbc787285 ; __tcf_qdisc_find.part.0+0x125 net/sched/cls_api.c:1215
0xffffffffbc787311: 49 c7 45 00 10 de f7 bc movq $-0x430821f0, (%r13)
0xffffffffbc787319: e9 67 ff ff ff jmp 0xffffffffbc787285 ; __tcf_qdisc_find.part.0+0x125 net/sched/cls_api.c:1215
0xffffffffbc78731e: e8 6d 28 25 ff callq 0xffffffffbb9d9b90 ; __rcu_read_unlock+0x0 kernel/rcu/tree_plugin.h:429
; __tcf_qdisc_find.part.0+0x1c3 net/sched/cls_api.c:1204
0xffffffffbc787323: bb ed ff ff ff movl $0xffffffed, %ebx
0xffffffffbc787328: e9 c5 fe ff ff jmp 0xffffffffbc7871f2 ; __tcf_qdisc_find.part.0+0x92 net/sched/cls_api.c:1260
0xffffffffbc78732d: 0f 1f 00 nopl (%rax)
$ sudo ./btrace -d -k do_trace_netlink_extack
2025/03/16 22:43:02 Disassembling do_trace_netlink_extack at 0xffffffffbc7945d0 (128 bytes) ..
; do_trace_netlink_extack+0x0 net/netlink/af_netlink.c:155
0xffffffffbc7945d0: 0f 1f 44 00 00 nopl (%rax, %rax)
...
; do_trace_netlink_extack+0x2e include/trace/events/netlink.h:9
0xffffffffbc7945fe: 48 8b 05 83 db a8 01 movq 0x1a8db83(%rip), %rax
0xffffffffbc794605: 48 85 c0 testq %rax, %rax
0xffffffffbc794608: 74 0f je 0xffffffffbc794619 ; do_trace_netlink_extack+0x49 include/trace/events/netlink.h:9
0xffffffffbc79460a: 48 8b 40 08 movq 8(%rax), %rax
0xffffffffbc79460e: 48 89 fe movq %rdi, %rsi
0xffffffffbc794611: 48 89 c7 movq %rax, %rdi
0xffffffffbc794614: e8 c7 00 00 00 callq 0xffffffffbc7946e0 ; __bpf_trace_netlink_extack+0x0 include/trace/events/netlink.h:9
0xffffffffbc794619: 65 ff 0d a8 16 8a 43 decl %gs:0x438a16a8(%rip)
...
$ sudo ./btrace -d -k __bpf_trace_netlink_extack
2025/03/16 22:44:56 Disassembling __bpf_trace_netlink_extack at 0xffffffffbc7946e0 (31 bytes) ..
; __bpf_trace_netlink_extack+0x0 include/trace/events/netlink.h:9
0xffffffffbc7946e0: 55 pushq %rbp
0xffffffffbc7946e1: 48 89 e5 movq %rsp, %rbp
0xffffffffbc7946e4: e8 a7 77 35 ff callq 0xffffffffbbaebe90 ; bpf_trace_run1+0x0 kernel/trace/bpf_trace.c:2446
0xffffffffbc7946e9: 5d popq %rbp
0xffffffffbc7946ea: 31 f6 xorl %esi, %esi
0xffffffffbc7946ec: 31 ff xorl %edi, %edi
0xffffffffbc7946ee: c3 retq
$ sudo ./btrace -d -k bpf_trace_run1
2025/03/16 22:46:58 Disassembling bpf_trace_run1 at 0xffffffffbbaebe90 (272 bytes) ..
; bpf_trace_run1+0x0 kernel/trace/bpf_trace.c:2446
0xffffffffbbaebe90: 55 pushq %rbp
...
; bpf_trace_run1+0x7e include/linux/bpf.h:1243
0xffffffffbbaebf0e: 48 8d 7d d0 leaq -0x30(%rbp), %rdi
0xffffffffbbaebf12: ff d0 callq *%rax
0xffffffffbbaebf14: 0f 1f 00 nopl (%rax)
...
|
Emm,用的还是 indirect call ……
分析 kprobe 的 bpf 程序被调用前的函数调用栈、逻辑分支记录栈和反汇编的机器码
kprobe 学习资料推荐:
使用 eBPF Talk: trace kprobe 程序 里的例子,直接使用 btrace
看看 kprobe 里调用 bpf 程序的函数调用栈:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
$ sudo ./btrace -p 254 --output-stack --limit-events 1 -v
...
2025/03/16 23:03:57 btrace is running..
k_tcp_connect[bpf] args=((struct pt_regs *)ctx=0xffffa0c7446279c0) retval=(int)0 cpu=11 process=(21237:curl)
Func stack:
...
0xffffffffc038371d:k_tcp_connect+0x5 ; tcp-connecting.c:31
0xffffffffbbaf54da:kprobe_perf_func+0x5a ; kernel/trace/trace_kprobe.c:1673
0xffffffffbbaf57bd:kprobe_dispatcher+0x6d ; kernel/trace/trace_kprobe.c:1809
0xffffffffbb8c0983:kprobe_ftrace_handler+0x153 ; arch/x86/kernel/kprobes/ftrace.c:45
0xffffffffc1eb00f5:+0x0
0xffffffffbc7fe131:tcp_connect+0x1 ; net/ipv4/tcp_output.c:4066
0xffffffffbc82f323:__inet_stream_connect+0x113 ; net/ipv4/af_inet.c:679
0xffffffffbc82f62b:inet_stream_connect+0x3b ; net/ipv4/af_inet.c:751
0xffffffffbc6c9bab:__sys_connect_file+0x6b ; net/socket.c:2061
0xffffffffbc6c9c9f:__sys_connect+0xbf ; include/linux/file.h:47
0xffffffffbc6c9cf8:__x64_sys_connect+0x18 ; net/socket.c:2085
0xffffffffbb80b95f:x64_sys_call+0x212f ; arch/x86/entry/syscall_64.c:36
0xffffffffbcabb25e:do_syscall_64+0x7e ; arch/x86/entry/common.c:52
0xffffffffbcc0012b:entry_SYSCALL_64_after_hwframe+0x76 ; arch/x86/entry/entry_64.S:130
|
从 tcp_connect
到 k_tcp_connect
bpf 程序有 5 个函数调用深度。
kprobe 的 bpf 程序被调用前的逻辑分支记录栈
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
$ sudo ./btrace -p 302 --mode entry --output-lbr --limit-events 1 -v
...
2025/03/16 23:31:32 btrace is running..
k_tcp_connect[bpf] args=((struct pt_regs *)ctx=0xffffa0c743453ba0) cpu=0 process=(23295:curl)
LBR stack:
[#31] ...
0xffffffffbc7fe130:tcp_connect+0x0 (net/ipv4/tcp_output.c:4066) -> 0xffffffffc1eb0000:ftrace_trampoline+0x0
[#30] 0xffffffffc1eb00f0:+0x0 -> 0xffffffffbb8c0830:kprobe_ftrace_handler+0x0 (arch/x86/kernel/kprobes/ftrace.c:18)
0xffffffffbb8c08e4:kprobe_ftrace_handler+0xb4 (arch/x86/kernel/kprobes/ftrace.c:31) -> 0xffffffffbba69850:get_kprobe+0x0 (kernel/kprobes.c:377)
0xffffffffbba69876:get_kprobe+0x26 (kernel/kprobes.c:382) -> 0xffffffffbba69888:get_kprobe+0x38 (kernel/kprobes.c:384)
0xffffffffbba69893:get_kprobe+0x43 (kernel/kprobes.c:389) -> 0xffffffffbb8c08e9:kprobe_ftrace_handler+0xb9 (arch/x86/kernel/kprobes/ftrace.c:31)
0xffffffffbb8c0900:kprobe_ftrace_handler+0xd0 (arch/x86/kernel/kprobes/ftrace.c:36) -> 0xffffffffbb8c093b:kprobe_ftrace_handler+0x10b (include/linux/kprobes.h:408)
0xffffffffbb8c0981:kprobe_ftrace_handler+0x151 (arch/x86/kernel/kprobes/ftrace.c:45) -> 0xffffffffbbaf5750:kprobe_dispatcher+0x0 (kernel/trace/trace_kprobe.c:1796)
0xffffffffbbaf577d:kprobe_dispatcher+0x2d (kernel/trace/trace_kprobe.c:1805) -> 0xffffffffbbaf57b5:kprobe_dispatcher+0x65 (kernel/trace/trace_kprobe.c:1806)
0xffffffffbbaf57b8:kprobe_dispatcher+0x68 (kernel/trace/trace_kprobe.c:1806) -> 0xffffffffbbaf5480:kprobe_perf_func+0x0 (kernel/trace/trace_kprobe.c:1654)
0xffffffffbbaf54d5:kprobe_perf_func+0x55 (kernel/trace/trace_kprobe.c:1665) -> 0xffffffffbbaf0ad0:trace_call_bpf+0x0 (kernel/trace/bpf_trace.c:112)
0xffffffffbbaf0b0a:trace_call_bpf+0x3a (kernel/trace/bpf_trace.c:117) -> 0xffffffffbb9d3a20:__rcu_read_lock+0x0 (kernel/rcu/tree_plugin.h:411)
0xffffffffbb9d3a45:__rcu_read_lock+0x25 (kernel/rcu/tree_plugin.h:418) -> 0xffffffffbbaf0b0f:trace_call_bpf+0x3f (kernel/trace/bpf_trace.c:147)
0xffffffffbbaf0b38:trace_call_bpf+0x68 (include/linux/bpf.h:2099) -> 0xffffffffbb952360:migrate_disable+0x0 (kernel/sched/core.c:2232)
0xffffffffbb952377:migrate_disable+0x17 (kernel/sched/core.c:2235) -> 0xffffffffbb95238c:migrate_disable+0x2c (kernel/sched/core.c:2237)
0xffffffffbb9523bb:migrate_disable+0x5b (include/linux/preempt.h:480) -> 0xffffffffbbaf0b3d:trace_call_bpf+0x6d (include/linux/bpf.h:2099)
0xffffffffbbaf0b84:trace_call_bpf+0xb4 (include/linux/bpf.h:1243) -> 0xffffffffc0383708:k_tcp_connect+0x0 (tcp-connecting.c:31)
0xffffffffc0383708:k_tcp_connect+0x0 (tcp-connecting.c:31) -> 0xffffffffc03a1ac0:bpf_trampoline_1297080123393+0x0
...
|
从 tcp_connect
到 k_tcp_connect
bpf 程序的逻辑分支记录栈有 16 个逻辑分支。
kprobe 的 bpf 程序被调用前的反汇编的机器码
笔者有点懒,请读者自行分析吧。
分析 fentry/fexit 的 bpf 程序被调用前的函数调用栈、逻辑分支记录栈和反汇编的机器码
其实,btrace
使用的就是 fentry
和 fexit
;以 kprobe 为例,看看 fentry
和 fexit
的 bpf 程序被调用前的函数调用栈:
1
2
3
4
5
6
7
8
9
|
$ sudo ./btrace -p 254 --output-stack --limit-events 1 -v
2025/03/16 23:03:57 btrace is running..
k_tcp_connect[bpf] args=((struct pt_regs *)ctx=0xffffa0c7446279c0) retval=(int)0 cpu=11 process=(21237:curl)
Func stack:
0xffffffffc03a34bf:fexit_fn+0x34b ; bpf/btrace.c:106
0xffffffffc03a34bf:fexit_fn+0x34b ; bpf/btrace.c:106
0xffffffffc03a1d77:bpf_trampoline_1090921693185+0xb7
0xffffffffc038371d:k_tcp_connect+0x5 ; tcp-connecting.c:31
...
|
从 k_tcp_connect
到 fexit_fn
只有 2 个函数调用深度。
看看 fentry
和 fexit
的 bpf 程序被调用前的逻辑分支记录栈:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
$ sudo ./btrace -p 348 --mode entry --output-lbr --limit-events 1 -v
。。。
2025/03/16 23:41:51 btrace is running..
k_tcp_connect[bpf] args=((struct pt_regs *)ctx=0xffffa0c7434dbbe0) cpu=0 process=(24048:curl)
LBR stack:
[#31] ...
0xffffffffc0383708:k_tcp_connect+0x0 (tcp-connecting.c:31) -> 0xffffffffc03a1ac0:bpf_trampoline_1494648619009+0x0
0xffffffffc03a1af9:bpf_trampoline_1494648619009+0x39 -> 0xffffffffbbb6c5c0:__bpf_prog_enter_recur+0x0
0xffffffffbbb6c5cd:__bpf_prog_enter_recur+0xd -> 0xffffffffbb9d3a20:__rcu_read_lock+0x0 (kernel/rcu/tree_plugin.h:411)
0xffffffffbb9d3a45:__rcu_read_lock+0x25 (kernel/rcu/tree_plugin.h:418) -> 0xffffffffbbb6c5d2:__bpf_prog_enter_recur+0x12
0xffffffffbbb6c5d2:__bpf_prog_enter_recur+0x12 -> 0xffffffffbb952360:migrate_disable+0x0 (kernel/sched/core.c:2232)
0xffffffffbb952387:migrate_disable+0x27 (kernel/sched/core.c:2237) -> 0xffffffffbbb6c5d7:__bpf_prog_enter_recur+0x17
0xffffffffbbb6c614:__bpf_prog_enter_recur+0x54 -> 0xffffffffc03a1afe:bpf_trampoline_1494648619009+0x3e
0xffffffffc03a1b0a:bpf_trampoline_1494648619009+0x4a -> 0xffffffffc03a3160:fentry_fn+0x0 (bpf/btrace.c:129)
0xffffffffc03a3200:fentry_fn+0xa0 (bpf/btrace.c:83) -> 0xffffffffbb81e680:intel_pmu_snapshot_arch_branch_stack+0x0 (arch/x86/events/intel/core.c:2359)
...
|
从 k_tcp_connect
到 fentry_fn
有 8 个逻辑分支。
有点不合理;这 8 个逻辑分支完全没有存在的必要;笔者已经计划提 patch 将这些无用的逻辑分支记录优化掉。
快速看看 fentry
和 fexit
的 bpf 程序被调用前的反汇编的机器码:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
$ sudo ./btrace -d -k bpf_trampoline_1494648619009 -B 500
2025/03/16 23:44:18 Disassembling bpf_trampoline_1494648619009 at 0xffffffffc03a1ac0 (183 bytes) ..
; bpf_trampoline_1494648619009+0x0
0xffffffffc03a1ac0: 55 pushq %rbp
0xffffffffc03a1ac1: 48 89 e5 movq %rsp, %rbp
0xffffffffc03a1ac4: 48 83 ec 30 subq $0x30, %rsp
0xffffffffc03a1ac8: 50 pushq %rax
0xffffffffc03a1ac9: 48 89 5d e0 movq %rbx, -0x20(%rbp)
0xffffffffc03a1acd: b8 01 00 00 00 movl $1, %eax
0xffffffffc03a1ad2: 48 89 45 f0 movq %rax, -0x10(%rbp)
0xffffffffc03a1ad6: 48 c7 c0 08 37 38 c0 movq $-0x3fc7c8f8, %rax
0xffffffffc03a1add: 48 89 45 e8 movq %rax, -0x18(%rbp)
0xffffffffc03a1ae1: 48 89 7d f8 movq %rdi, -8(%rbp)
0xffffffffc03a1ae5: 31 ff xorl %edi, %edi
0xffffffffc03a1ae7: 48 89 7d d0 movq %rdi, -0x30(%rbp)
0xffffffffc03a1aeb: 48 bf 00 70 50 40 c7 a0 ff ff movabsq $0xffffa0c740507000, %rdi
0xffffffffc03a1af5: 48 8d 75 d0 leaq -0x30(%rbp), %rsi
0xffffffffc03a1af9: e8 c2 aa 7c fb callq 0xffffffffbbb6c5c0 ; __bpf_prog_enter_recur+0x0
0xffffffffc03a1afe: 48 89 c3 movq %rax, %rbx
0xffffffffc03a1b01: 48 85 c0 testq %rax, %rax
0xffffffffc03a1b04: 74 09 je 0xffffffffc03a1b0f ; bpf_trampoline_1494648619009+0x4f
0xffffffffc03a1b06: 48 8d 7d f8 leaq -8(%rbp), %rdi
0xffffffffc03a1b0a: e8 51 16 00 00 callq 0xffffffffc03a3160 ; fentry_fn+0x0 bpf/btrace.c:129 [bpf]
0xffffffffc03a1b0f: 48 bf 00 70 50 40 c7 a0 ff ff movabsq $0xffffa0c740507000, %rdi
0xffffffffc03a1b19: 48 89 de movq %rbx, %rsi
0xffffffffc03a1b1c: 48 8d 55 d0 leaq -0x30(%rbp), %rdx
0xffffffffc03a1b20: e8 db ad 7c fb callq 0xffffffffbbb6c900 ; __bpf_prog_exit_recur+0x0
0xffffffffc03a1b25: 48 8b 7d f8 movq -8(%rbp), %rdi
0xffffffffc03a1b29: 48 8b 85 c8 ff ff ff movq -0x38(%rbp), %rax
0xffffffffc03a1b30: 48 8b 5d e0 movq -0x20(%rbp), %rbx
0xffffffffc03a1b34: c9 leave
0xffffffffc03a1b35: c3 retq
0xffffffffc03a1b36: cc int3
|
可以看到,在 trampoline 里便直接调用了 fentry_fn
。
除了 --output-lbr
对 CPU 特性有要求外,-d
不依赖内核,其它功能就需要在 Linux v5.17+ 上才能正常使用。
总结
btrace
提供了一种简单的方法来分析 bpf 程序被调用前的函数调用栈、逻辑分支记录栈和反汇编的机器码,方便开发者快速定位问题。
在分析了 tracepoint
、raw_tp/tp_btf
、kprobe
和 fentry/fexit
的 bpf 程序被调用前的函数调用栈、逻辑分支记录栈和反汇编的机器码后,发现 fentry/fexit
的 bpf 程序被调用前的函数调用栈和逻辑分支记录栈最简洁,而 tracepoint
的 bpf 程序被调用前的逻辑分支记录栈最复杂。
笔者推荐,能用 fentry/fexit
就用 fentry/fexit
;需要使用 tracepoint
时,推荐使用 raw_tp/tp_btf
。