上个星期在做 eBPF Talk: 实战经验之 bpf FD 泄漏分析 的时候,发现 GitHub cilium/ebpf 库不支持从 FD 获取 bpf link 的信息。于是,我提了个 Issue 来讨论这事:

  1. 我:提出 GetLinkInfoFromFD()
  2. 大佬:疑问为什么不是 LoadLinkFromFD()
  3. 我:不喜欢 NewProgramFromFD()NewMapFromFD() 破坏原始 FD 的做法。
  4. 我:并提出 NewLinkFromFD(),并在函数里使用 syscall.Dup(fd) 复刻 FD。
  5. 大佬:好的,可以加 NewLinkFromFD(),但需要保持跟 NewProgramFromFD()NewMapFromFD() 一样的语义。
  6. 大佬:另,dup bpf-FD 应该用 unix.FcntlInt(fd, unix.F_DUPFD_CLOEXEC, 1),而不是 syscall.Dup(fd)

好吧,知道需要 dup FD,但不知道需要用 unix.FcntlInt(fd, unix.F_DUPFD_CLOEXEC, 1) 而不是 syscall.Dup(fd),受教了。

提了 PR 来实现 NewLinkFromFD(),并同时提供了使用 unix.FcntlInt(fd, unix.F_DUPFD_CLOEXEC, 1) dup bpf-FD 后 NewLinkFromFD() 的单元测试:

这两个系统调用之间有什么区别呢?

syscall.Dup(fd)

这是 Go 对 dup 系统调用的封装。

直接看下 man 2 dup 吧。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# man 2 dup
DUP(2)                                              Linux Programmer's Manual                                             DUP(2)

NAME
       dup, dup2, dup3 - duplicate a file descriptor

SYNOPSIS
       #include <unistd.h>

       int dup(int oldfd);

       ...

DESCRIPTION
       The  dup()  system call creates a copy of the file descriptor oldfd, using the lowest-numbered unused file descriptor for
       the new descriptor.

       After a successful return, the old and new file descriptors may be used interchangeably.  They refer  to  the  same  open
       file description (see open(2)) and thus share file offset and file status flags; for example, if the file offset is modi‐
       fied by using lseek(2) on one of the file descriptors, the offset is also changed for the other.

       The two file descriptors  do  not  share  file  descriptor  flags  (the  close-on-exec  flag).   The  close-on-exec  flag
       (FD_CLOEXEC; see fcntl(2)) for the duplicate descriptor is off.

dup 系统调用会复制原始 FD,并用最小未使用的 FD 当作新 FD。

而后,新旧 FD 可以交替使用。它们引用了相同的 open(2) 创建的 FD,并因此共享文件偏移、和文件状态标志。

它们不共享 FD 标志(如 close-on-exec)。复刻出来的 FD 没有采用 close-on-exec 标志(FD_CLOEXEC 请看 fcntl(2))。

unix.FcntlInt(fd, unix.F_DUPFD_CLOEXEC, 1)

这是 Go 对 fcntl 系统调用的封装。

直接看 man 2 fcntl

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# man 2 fcntl
FCNTL(2)                                            Linux Programmer's Manual                                           FCNTL(2)

NAME
       fcntl - manipulate file descriptor

SYNOPSIS
       #include <unistd.h>
       #include <fcntl.h>

       int fcntl(int fd, int cmd, ... /* arg */ );

DESCRIPTION
       ...

   Duplicating a file descriptor
       F_DUPFD (int)
              Duplicate the file descriptor fd using the lowest-numbered available file descriptor greater than or equal to arg.
              This is different from dup2(2), which uses exactly the file descriptor specified.

              On success, the new file descriptor is returned.

              See dup(2) for further details.

       F_DUPFD_CLOEXEC (int; since Linux 2.6.24)
              As  for  F_DUPFD,  but additionally set the close-on-exec flag for the duplicate file descriptor.  Specifying this
              flag permits a program to avoid an additional fcntl() F_SETFD operation to set the FD_CLOEXEC flag.  For an expla‐
              nation of why this flag is useful, see the description of O_CLOEXEC in open(2).

unix.FcntlInt(fd, unix.F_DUPFD_CLOEXEC, 1) 在复刻 FD 的同时,设置 close-on-exec 标志。

O_CLOEXEC in open(2):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# man 2 open
       O_CLOEXEC (since Linux 2.6.23)
              Enable the close-on-exec flag for the new file descriptor.  Specifying this flag permits a program to avoid  addi‐
              tional fcntl(2) F_SETFD operations to set the FD_CLOEXEC flag.

              Note  that  the  use  of  this flag is essential in some multithreaded programs, because using a separate fcntl(2)
              F_SETFD operation to set the FD_CLOEXEC flag does not suffice to avoid race conditions where one  thread  opens  a
              file  descriptor and attempts to set its close-on-exec flag using fcntl(2) at the same time as another thread does
              a fork(2) plus execve(2).  Depending on the order of execution, the race may lead to the file descriptor  returned
              by  open()  being  unintentionally  leaked to the program executed by the child process created by fork(2).  (This
              kind of race is in principle possible for any system call that creates a file descriptor whose close-on-exec  flag
              should  be set, and various other Linux system calls provide an equivalent of the O_CLOEXEC flag to deal with this
              problem.)

该标志就是用来避免多线程场景下的一些问题;设置了该标志,就能避免竟态下的一些问题。

内核源代码

囫囵吞枣地看了文档,接下来囫囵吞枣地看下这两个系统调用的源代码吧。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// ${KERNEL}/fs/file.c

SYSCALL_DEFINE1(dup, unsigned int, fildes)
{
    int ret = -EBADF;
    struct file *file = fget_raw(fildes);

    if (file) {
        ret = get_unused_fd_flags(0); // call alloc_fd() finally
        if (ret >= 0)
            fd_install(ret, file);
        else
            fput(file);
    }
    return ret;
}

int f_dupfd(unsigned int from, struct file *file, unsigned flags)
{
    unsigned long nofile = rlimit(RLIMIT_NOFILE);
    int err;
    if (from >= nofile)
        return -EINVAL;
    err = alloc_fd(from, nofile, flags);
    if (err >= 0) {
        get_file(file);
        fd_install(err, file);
    }
    return err;
}

// ${KERNEL}/fd/fcntl.c

static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
        struct file *filp)
{
    void __user *argp = (void __user *)arg;
    struct flock flock;
    long err = -EINVAL;

    switch (cmd) {
    case F_DUPFD:
        err = f_dupfd(arg, filp, 0);
        break;
    case F_DUPFD_CLOEXEC:
        err = f_dupfd(arg, filp, O_CLOEXEC);
        break;
    //...
    }
    return err;
}

unix.FcntlInt(fd, unix.F_DUPFD_CLOEXEC, 1)syscall.Dup(fd) 的最终区别体现在 alloc_fd() 分配一个新 FD 的表现:

  1. unix.FcntlInt(fd, unix.F_DUPFD_CLOEXEC, 1) 最终调 alloc_fd(1, nofile, O_CLOEXEC) 分配一个大于 1 且带有 close-on-exec 标志的 FD。
  2. syscall.Dup(fd) 最终调 alloc_fd(0, nofile, 0) 分配一个大于 0 且不带 close-on-exec 标志的 FD。

噢,发现最终真相:区别在于是否设置 close-on-exec 标志

总结

通过文档和内核源代码可知,dup bpf-FD 时这两种方式都是可行的(怪不得用 syscall.Dup(fd) 时没 panic)。

不过,为了更安全,推荐使用 unix.FcntlInt(fd, unix.F_DUPFD_CLOEXEC, 1)