在开发 percpu map CPU flags 补丁集的过程中,遇到了 2 个 LRU hash map 的问题。

在尝试修复这 2 个问题时,得到社区的反馈:留着它们吧。

屎山别动

能不用 LRU hash map 就别用

问题 1:丢数据

早有耳闻:LRU hash map 会丢数据。

在给 percpu map CPU flags 编写自测用例时,就遇上了丢数据问题:当 LRU hash map 满了时,再次 update_elem() 就会丢数据,即使想要更新的是已有条目。

update_elem() 发生了什么?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
htab_lru_map_update_elem()    // kernel/bpf/hashtab.c
|-->prealloc_lru_pop()        // kernel/bpf/bpf_lru_list.c
    |-->bpf_common_lru_pop_free() {
            node = __local_list_pop_free(loc_l);
            if (!node) {
                bpf_lru_list_pop_free_to_local(lru, loc_l);
                node = __local_list_pop_free(loc_l);
            }
    }
        |-->bpf_lru_list_pop_free_to_local()
            |-->__bpf_lru_list_shrink()
                |-->__bpf_lru_list_shrink_inactive()
                |   |-->lru->del_from_htab()
                |-->lru->del_from_htab()

解决办法:???

算了吧,LRU 作者并不喜欢我提出的办法。

问题 2:死锁

社区 CI、以及 syzbot、以及本地跑 LRU 测试用例,都能复现死锁问题:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
  [  418.260323] bpf_testmod: oh no, recursing into test_1, recursion_misses 1
  [  424.982201]
  [  424.982207] ================================
  [  424.982216] WARNING: inconsistent lock state
  [  424.982219] 6.18.0-rc1-gbb1b9387787c-dirty #1 Tainted: G        W  OE
  [  424.982221] --------------------------------
  [  424.982223] inconsistent {INITIAL USE} -> {IN-NMI} usage.
  [  424.982225] new_name/11207 [HC1[1]:SC0[0]:HE0:SE1] takes:
  [  424.982229] ffffe8ffffd9c000 (&loc_l->lock){....}-{2:2}, at:
bpf_lru_pop_free+0x2c6/0x1a50
  [  424.982244] {INITIAL USE} state was registered at:
  [  424.982246]   lock_acquire+0x154/0x2d0
  [  424.982252]   _raw_spin_lock_irqsave+0x39/0x60
  [  424.982259]   bpf_lru_pop_free+0x2c6/0x1a50
  [  424.982262]   htab_lru_map_update_elem+0x17e/0xa90
  [  424.982266]   bpf_map_update_value+0x5aa/0x1230
  [  424.982272]   __sys_bpf+0x33b4/0x4ef0
  [  424.982275]   __x64_sys_bpf+0x78/0xe0
  [  424.982278]   do_syscall_64+0x6a/0x2f0
  [  424.982282]   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [  424.982287] irq event stamp: 236
  [  424.982288] hardirqs last  enabled at (235): [<ffffffff959e4e70>]
do_syscall_64+0x30/0x2f0
  [  424.982292] hardirqs last disabled at (236): [<ffffffff959e65df>]
exc_nmi+0x7f/0x110
  [  424.982296] softirqs last  enabled at (0): [<ffffffff933fe7cf>]
copy_process+0x1c3f/0x6ab0
  [  424.982302] softirqs last disabled at (0): [<0000000000000000>] 0x0
  [  424.982305]
  [  424.982305] other info that might help us debug this:
  [  424.982306]  Possible unsafe locking scenario:
  [  424.982306]
  [  424.982307]        CPU0
  [  424.982308]        ----
  [  424.982309]   lock(&loc_l->lock);
  [  424.982311]   <Interrupt>
  [  424.982312]     lock(&loc_l->lock);
  [  424.982314]
  [  424.982314]  *** DEADLOCK ***
  [  424.982314]
  [  424.982315] no locks held by new_name/11207.
  [  424.982317]
  [  424.982317] stack backtrace:
  [  424.982326] CPU: 1 UID: 0 PID: 11207 Comm: new_name Tainted: G
    W  OE       6.18.0-rc1-gbb1b9387787c-dirty #1 PREEMPT(full)
  [  424.982332] Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
  [  424.982334] Hardware name: QEMU Ubuntu 25.04 PC (i440FX + PIIX,
1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
  [  424.982337] Call Trace:
  [  424.982340]  <NMI>
  [  424.982342]  dump_stack_lvl+0x5d/0x80
  [  424.982356]  print_usage_bug.part.0+0x22b/0x2c0
  [  424.982360]  lock_acquire+0x278/0x2d0
  [  424.982364]  ? __irq_work_queue_local+0x133/0x360
  [  424.982371]  ? bpf_lru_pop_free+0x2c6/0x1a50
  [  424.982375]  _raw_spin_lock_irqsave+0x39/0x60
  [  424.982379]  ? bpf_lru_pop_free+0x2c6/0x1a50
  [  424.982382]  bpf_lru_pop_free+0x2c6/0x1a50
  [  424.982387]  ? arch_irq_work_raise+0x3f/0x60
  [  424.982394]  ? __pfx___irq_work_queue_local+0x10/0x10
  [  424.982399]  htab_lru_map_update_elem+0x17e/0xa90
  [  424.982405]  ? __pfx_htab_lru_map_update_elem+0x10/0x10
  [  424.982408]  ? __kasan_check_byte+0x16/0x60
  [  424.982414]  ? __htab_map_lookup_elem+0x95/0x220
  [  424.982420]  bpf_prog_2c77131b3c031599_oncpu_lru_map+0xe4/0x168
  [  424.982423]  __perf_event_overflow+0x8e8/0xea0
  [  424.982430]  ? __pfx___perf_event_overflow+0x10/0x10
  [  424.982436]  handle_pmi_common+0x3fe/0x810
  [  424.982441]  ? __pfx_handle_pmi_common+0x10/0x10
  [  424.982452]  ? __pfx_intel_bts_interrupt+0x10/0x10
  [  424.982458]  intel_pmu_handle_irq+0x1c5/0x5d0
  [  424.982461]  ? lock_acquire+0x1ef/0x2d0
  [  424.982465]  ? nmi_handle.part.0+0x2f/0x380
  [  424.982469]  perf_event_nmi_handler+0x3e/0x70
  [  424.982476]  nmi_handle.part.0+0x13f/0x380
  [  424.982480]  ? trace_rcu_watching+0x105/0x170
  [  424.982486]  default_do_nmi+0x3b/0x110
  [  424.982490]  ? irqentry_nmi_enter+0x6f/0x80
  [  424.982493]  exc_nmi+0xe3/0x110
  [  424.982497]  end_repeat_nmi+0xf/0x53
  [  424.982502] RIP: 0010:fput_close_sync+0x56/0x1a0
  [  424.982509] Code: 48 89 e5 48 c7 04 24 b3 8a b5 41 48 c7 44 24 08
5c a2 3e 96 48 c1 ed 03 48 c7 44 24 10 10 a7 e0 93 42 c7 44 2d 00 f1
f1 f1 f1 <42> c7 44 2d 04 00 f3 f3 f3 65 48 8b 05 91 98 56 04 48 89 44
24 58
  [  424.982513] RSP: 0018:ffffc900099d7e88 EFLAGS: 00000a06
  [  424.982517] RAX: 0000000000000000 RBX: ffff888109fb48c0 RCX:
0000000000000000
  [  424.982520] RDX: 1ffff110099572bb RSI: 0000000000000008 RDI:
ffff888109fb4a20
  [  424.982522] RBP: 1ffff9200133afd1 R08: ffff888109fb48c0 R09:
ffff888109278b40
  [  424.982524] R10: ffff888109fb4920 R11: 0000000000000000 R12:
0000000000000003
  [  424.982526] R13: dffffc0000000000 R14: 0000000000000003 R15:
0000000000000000
  [  424.982532]  ? fput_close_sync+0x56/0x1a0
  [  424.982537]  ? fput_close_sync+0x56/0x1a0
  [  424.982541]  </NMI>
  [  424.982542]  <TASK>
  [  424.982544]  ? __pfx_fput_close_sync+0x10/0x10
  [  424.982548]  ? do_raw_spin_unlock+0x59/0x250
  [  424.982553]  __x64_sys_close+0x7d/0xd0
  [  424.982559]  do_syscall_64+0x6a/0x2f0
  [  424.982563]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [  424.982566] RIP: 0033:0x7faae0f88fe2
  [  424.982569] Code: 08 0f 85 71 3a ff ff 49 89 fb 48 89 f0 48 89 d7
48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24
08 0f 05 <c3> 66 2e 0f 1f 84 00 00 00 00 00 66 2e 0f 1f 84 00 00 00 00
00 66
  [  424.982571] RSP: 002b:00007ffe58ee5b08 EFLAGS: 00000246 ORIG_RAX:
0000000000000003
  [  424.982574] RAX: ffffffffffffffda RBX: 00007faae0a6cb00 RCX:
00007faae0f88fe2
  [  424.982577] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
0000000000000072
  [  424.982579] RBP: 00007ffe58ee5b30 R08: 0000000000000000 R09:
0000000000000000
  [  424.982581] R10: 0000000000000000 R11: 0000000000000246 R12:
0000000000000008
  [  424.982583] R13: 0000000000000000 R14: 0000556f5e250c90 R15:
00007faae11e9000
  [  424.982588]  </TASK>

注意:WARNING: inconsistent lock state

仅是一个 WARNING,不是一定要解决的问题。

因而,当我将解决办法发到社区后,maintainer 回复:“If it’s too hard, then leave it as-is."。意即:太难了,就别动。

小结

经过这 2 次经历后,对社区风格有了更加深刻的体会。