mirror of
https://github.com/0xAX/linux-insides.git
synced 2024-12-22 14:48:08 +00:00
Update interrupts-3.md
This commit is contained in:
parent
0ebe9f97ba
commit
de83e8ea3e
@ -1,10 +1,12 @@
|
||||
Interrupts and Interrupt Handling. Part 3.
|
||||
================================================================================
|
||||
|
||||
Interrupt handlers
|
||||
Exception Handling
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the third part of the [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) about an interrupts and an exceptions handling and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) we stopped in the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blame/master/arch/x86/kernel/setup.c) on the setting of the two exceptions handlers for the two following exceptions:
|
||||
This is the third part of the [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) about an interrupts and an exceptions handling in the Linux kernel and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) we stopped at the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blame/master/arch/x86/kernel/setup.c) source code file.
|
||||
|
||||
We already know that this function executes initialization of architecture-specfic stuff. In our case the `setup_arch` function does [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture related initializations. The `setup_arch` is big function, and in the previous part we stopped on the setting of the two exceptions handlers for the two following exceptions:
|
||||
|
||||
* `#DB` - debug exception, transfers control from the interrupted process to the debug handler;
|
||||
* `#BP` - breakpoint exception, caused by the `int 3` instruction.
|
||||
@ -22,22 +24,28 @@ void __init early_trap_init(void)
|
||||
}
|
||||
```
|
||||
|
||||
from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). We already saw implementation of the `set_intr_gate_ist` and `set_system_intr_gate_ist` functions in the previous part and now we will look on the implementation of these early exceptions handlers.
|
||||
from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). We already saw implementation of the `set_intr_gate_ist` and `set_system_intr_gate_ist` functions in the previous part and now we will look on the implementation of these two exceptions handlers.
|
||||
|
||||
Debug and Breakpoint exceptions
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Ok, we set the interrupts gates in the `early_trap_init` function for the `#DB` and `#BP` exceptions and now time is to look on their handlers. But first of all let's look on these exceptions. The first exceptions - `#DB` or debug exception occurs when a debug event occurs, for example attempt to change the contents of a [debug register](http://en.wikipedia.org/wiki/X86_debug_register). Debug registers are special registers which present in processors starting from the [Intel 80386](http://en.wikipedia.org/wiki/Intel_80386) and as you can understand from its name they are used for debugging. These registers allow to set breakpoints on the code and read or write data to trace, thus tracking the place of errors. The debug registers are privileged resources available and the program in either real-address or protected mode at `CPL` is `0`, that's why we have used `set_intr_gate_ist` for the `#DB`, but not the `set_system_intr_gate_ist`. The verctor number of the `#DB` exceptions is `1` (we pass it as `X86_TRAP_DB`) and has no error code:
|
||||
Ok, we setup exception handlers in the `early_trap_init` function for the `#DB` and `#BP` exceptions and now time is to consider their implementations. But before we will do this, first of all let's look on details of these exceptions.
|
||||
|
||||
The first exceptions - `#DB` or `debug` exception occurs when a debug event occurs. For example - attempt to change the contents of a [debug register](http://en.wikipedia.org/wiki/X86_debug_register). Debug registers are special registers that were presented in `x86` processors starting from the [Intel 80386](http://en.wikipedia.org/wiki/Intel_80386) processor and as you can understand from name of this CPU extension, main purpose of these registers is debugging.
|
||||
|
||||
These registers allow to set breakpoints on the code and read or write data to trace it. Debug registers may be accessed only in the privileged mode and an attempt to read or write the debug registers when executing at any other privilege level causes a [general protection fault](https://en.wikipedia.org/wiki/General_protection_fault) exception. That's why we have used `set_intr_gate_ist` for the `#DB` exception, but not the `set_system_intr_gate_ist`.
|
||||
|
||||
The verctor number of the `#DB` exceptions is `1` (we pass it as `X86_TRAP_DB`) and as we may read in specification, this exception has no error code:
|
||||
|
||||
```
|
||||
----------------------------------------------------------------------------------------------
|
||||
|Vector|Mnemonic|Description |Type |Error Code|Source |
|
||||
----------------------------------------------------------------------------------------------
|
||||
|1 | #DB |Reserved |F/T |NO | |
|
||||
----------------------------------------------------------------------------------------------
|
||||
+-----------------------------------------------------+
|
||||
|Vector|Mnemonic|Description |Type |Error Code|
|
||||
+-----------------------------------------------------+
|
||||
|1 | #DB |Reserved |F/T |NO |
|
||||
+-----------------------------------------------------+
|
||||
```
|
||||
|
||||
The second is `#BP` or breakpoint exception occurs when processor executes the [INT 3](http://en.wikipedia.org/wiki/INT_%28x86_instruction%29#INT_3) instruction. We can add it anywhere in our code, for example let's look on the simple program:
|
||||
The second exception is `#BP` or `breakpoint` exception occurs when processor executes the [int 3](http://en.wikipedia.org/wiki/INT_%28x86_instruction%29#INT_3) instruction. Unlike the `DB` exception, the `#BP` exception may occur in userspace. We can add it anywhere in our code, for example let's look on the simple program:
|
||||
|
||||
```C
|
||||
// breakpoint.c
|
||||
@ -94,54 +102,56 @@ Program received signal SIGTRAP, Trace/breakpoint trap.
|
||||
...
|
||||
```
|
||||
|
||||
Now we know a little about these two exceptions and we can move on to consideration of their handlers.
|
||||
From this moment we know a little about these two exceptions and we can move on to consideration of their handlers.
|
||||
|
||||
Preparation before an interrupt handler
|
||||
Preparation before an exception handler
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As you can note, the `set_intr_gate_ist` and `set_system_intr_gate_ist` functions takes an addresses of the exceptions handlers in the second parameter:
|
||||
As you may note before, the `set_intr_gate_ist` and `set_system_intr_gate_ist` functions takes an addresses of exceptions handlers in theirs second parameter. In or case our two exception handlers will be:
|
||||
|
||||
* `&debug`;
|
||||
* `&int3`.
|
||||
* `debug`;
|
||||
* `int3`.
|
||||
|
||||
You will not find these functions in the C code. All that can be found in the `*.c/*.h` files only definition of this functions in the [arch/x86/include/asm/traps.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/traps.h):
|
||||
You will not find these functions in the C code. all of that could be found in the kernel's `*.c/*.h` files only definition of these functions which are located in the [arch/x86/include/asm/traps.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/traps.h) kernel header file:
|
||||
|
||||
```C
|
||||
asmlinkage void debug(void);
|
||||
```
|
||||
|
||||
and
|
||||
|
||||
```C
|
||||
asmlinkage void int3(void);
|
||||
```
|
||||
|
||||
But we can see `asmlinkage` descriptor here. The `asmlinkage` is the special specificator of the [gcc](http://en.wikipedia.org/wiki/GNU_Compiler_Collection). Actually for a `C` functions which are called from assembly, we need in explicit declaration of the function calling convention. In our case, if function maked with `asmlinkage` descriptor, then `gcc` will compile the function to retrieve parameters from stack. So, both handlers are defined in the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) assembly source code file with the `idtentry` macro:
|
||||
You may note `asmlinkage` directive in definitions of these functions. The directive is the special specificator of the [gcc](http://en.wikipedia.org/wiki/GNU_Compiler_Collection). Actually for a `C` functions which are called from assembly, we need in explicit declaration of the function calling convention. In our case, if function maked with `asmlinkage` descriptor, then `gcc` will compile the function to retrieve parameters from stack.
|
||||
|
||||
So, both handlers are defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly source code file with the `idtentry` macro:
|
||||
|
||||
```assembly
|
||||
idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
|
||||
```
|
||||
|
||||
and
|
||||
|
||||
```assembly
|
||||
idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
|
||||
```
|
||||
|
||||
Actually `debug` and `int3` are not interrupts handlers. Remember that before we can execute an interrupt/exception handler, we need to do some preparations as:
|
||||
Each exception handler may be consists from two parts. The first part is generic part and it is the same for all exception handlers. An exception handler should to save [general purpose registers](https://en.wikipedia.org/wiki/Processor_register) on the stack, switch to kernel stack if an exception came from userspace and transfer control to the second part of an exception handler. The second part of an exception handler does certain work depends on certain exception. For example page fault exception handler should find virtual page for given address, invalid opcode exception handler should send `SIGILL` [signal](https://en.wikipedia.org/wiki/Unix_signal) and etc.
|
||||
|
||||
* When an interrupt or exception occurred, the processor uses an exception or interrupt vector as an index to a descriptor in the `IDT`;
|
||||
* In legacy mode `ss:esp` registers are pushed on the stack only if privilege level changed. In 64-bit mode `ss:rsp` pushed on the stack everytime;
|
||||
* During stack switching with `IST` the new `ss` selector is forced to null. Old `ss` and `rsp` are pushed on the new stack.
|
||||
* The `rflags`, `cs`, `rip` and error code pushed on the stack;
|
||||
* Control transferred to an interrupt handler;
|
||||
* After an interrupt handler will finish its work and finishes with the `iret` instruction, old `ss` will be poped from the stack and loaded to the `ss` register.
|
||||
* `ss:rsp` will be popped from the stack unconditionally in the 64-bit mode and will be popped only if there is a privilege level change in legacy mode.
|
||||
* `iret` instruction will restore `rip`, `cs` and `rflags`;
|
||||
* Interrupted program will continue its execution.
|
||||
As we just saw, an exception handler starts from definition of the `idtentry` macro from the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) assembly source code file, so let's look at implementation of this macro. As we may see, the `idtentry` macro takes five arguments:
|
||||
|
||||
```
|
||||
+--------------------+
|
||||
+40 | ss |
|
||||
+32 | rsp |
|
||||
+24 | rflags |
|
||||
+16 | cs |
|
||||
+8 | rip |
|
||||
0 | error code |
|
||||
+--------------------+
|
||||
```
|
||||
* `sym` - defines global symbol with the `.globl name` which will be an an entry of exception handler;
|
||||
* `do_sym` - symbol name which represents a secondary entry of an exception handler;
|
||||
* `has_error_code` - information about existence of an error code of exception.
|
||||
|
||||
Now we can see on the preparations before a process will transfer control to an interrupt/exception handler from practical side. As I already wrote above the first thirteen exceptions handlers defined in the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) assembly file with the [idtentry](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S#L967) macro:
|
||||
The last two parameters are optional:
|
||||
|
||||
* `paranoid` - shows us how we need to check current mode (will see explanation in details later);
|
||||
* `shift_ist` - shows us is an exception running at `Interrupt Stack Table`.
|
||||
|
||||
Definition of the `.idtentry` macro looks:
|
||||
|
||||
```assembly
|
||||
.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
|
||||
@ -153,107 +163,203 @@ END(\sym)
|
||||
.endm
|
||||
```
|
||||
|
||||
This macro defines an exception entry point and as we can see it takes `five` arguments:
|
||||
Before we will consider internals of the `idtentry` macro, we should to know state of stack when an exception occurs. As we may read in the [Intel® 64 and IA-32 Architectures Software Developer’s Manual 3A](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html), the state of stack when an exception occurs is following:
|
||||
|
||||
* `sym` - defines global symbol with the `.globl name`.
|
||||
* `do_sym` - an interrupt handler.
|
||||
* `has_error_code:req` - information about error code, The `:req` qualifier tells the assembler that the argument is required;
|
||||
* `paranoid` - shows us how we need to check current mode;
|
||||
* `shift_ist` - shows us what's stack to use;
|
||||
```
|
||||
+------------+
|
||||
+40 | %SS |
|
||||
+32 | %RSP |
|
||||
+24 | %RFLAGS |
|
||||
+16 | %CS |
|
||||
+8 | %RIP |
|
||||
0 | ERROR CODE | <-- %RSP
|
||||
+------------+
|
||||
```
|
||||
|
||||
As we can see our exceptions handlers are almost the same:
|
||||
Now we may start to consider implementation of the `idtmacro`. Both `#DB` and `BP` exception handlers are defined as:
|
||||
|
||||
```assembly
|
||||
idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
|
||||
idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
|
||||
```
|
||||
|
||||
The differences are only in the global name and name of exceptions handlers. Now let's look how `idtentry` macro implemented. It starts from the two checks:
|
||||
|
||||
```assembly
|
||||
.if \shift_ist != -1 && \paranoid == 0
|
||||
.error "using shift_ist requires paranoid=1"
|
||||
.endif
|
||||
|
||||
.if \has_error_code
|
||||
XCPT_FRAME
|
||||
.else
|
||||
INTR_FRAME
|
||||
.endif
|
||||
```
|
||||
|
||||
First check makes the check that an exceptions uses `Interrupt stack table` and `paranoid` is set, in other way it emits the erorr with the [.error](https://sourceware.org/binutils/docs/as/Error.html#Error) directive. The second `if` clause checks existence of an error code and calls `XCPT_FRAME` or `INTR_FRAME` macros depends on it. These macros just expand to the set of [CFI directives](https://sourceware.org/binutils/docs/as/CFI-directives.html) which are used by `GNU AS` to manage call frames. The `CFI` directives are used only to generate [dwarf2](http://en.wikipedia.org/wiki/DWARF) unwind information for better backtraces and they don't change any code, so we will not go into detail about it and from this point I will skip all code which is related to these directives. In the next step we check error code again and push it on the stack if an exception has it with the:
|
||||
If we will look at these definitions, we may know that compiler will generate two routines with `debug` and `int3` names and both of these exception handlers will call `do_debug` and `do_int3` secondary handlers after some preparation. The third parameter defines existence of error code and as we may see both our exception do not have them. As we may see on the diagram above, processor pushes error code on stack if an exception provides it. In our case, the `debug` and `int3` exception do not have error codes. This may bring some difficulties because stack will look differently for exceptions which provides error code and for exceptions which not. That's why implementation of the `idtentry` macro starts from putting a fake error code to the stack if an exception does not provide it:
|
||||
|
||||
```assembly
|
||||
.ifeq \has_error_code
|
||||
pushq_cfi $-1
|
||||
pushq $-1
|
||||
.endif
|
||||
```
|
||||
|
||||
The `pushq_cfi` macro defined in the [arch/x86/include/asm/dwarf2.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/dwarf2.h) and expands to the `pushq` instruction which pushes given error code:
|
||||
But it is not only fake error-code. Moreover the `-1` also represents invalid system call number, so that the system call restart logic will not be triggered.
|
||||
|
||||
The last two parameters of the `idtentry` macro `shift_ist` and `paranoid` allow to know do an exception handler runned at stack from `Interrupt Stack Table` or not. You already may know that each kernel thread in the system has own stack. In addition to these stacks, there are some specialized stacks associated with each processor in the system. One of these stacks is - exception stack. The [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture provides special feature which is called - `Interrupt Stack Table`. This feature allows to switch to a new stack for designated events such as an atomic exceptions like `double fault` and etc. So the `shift_ist` parameter allows us to know do we need to switch on `IST` stack for an exception handler or not.
|
||||
|
||||
The second parameter - `paranoid` defines the method which helps us to know did we come from userspace or not to an exception handler. The easiest way to determine this is to via `CPL` or `Current Privilege Level` in `CS` segment register. If it is equal to `3`, we came from userspace, if zero we came from kernel space:
|
||||
|
||||
```
|
||||
testl $3,CS(%rsp)
|
||||
jnz userspace
|
||||
...
|
||||
...
|
||||
...
|
||||
// we are from the kernel space
|
||||
```
|
||||
|
||||
But unfortunately this method does not give a 100% guarantee. As described in the kernel documentation:
|
||||
|
||||
> if we are in an NMI/MCE/DEBUG/whatever super-atomic entry context,
|
||||
> which might have triggered right after a normal entry wrote CS to the
|
||||
> stack but before we executed SWAPGS, then the only safe way to check
|
||||
> for GS is the slower method: the RDMSR.
|
||||
|
||||
In other words for example `NMI` could happen inside the critical section of a [swapgs](http://www.felixcloutier.com/x86/SWAPGS.html) instruction. In this way we should check value of the `MSR_GS_BASE` [model specific register](https://en.wikipedia.org/wiki/Model-specific_register) which stores pointer to the start of per-cpu area. So to check did we come from userspace or not, we should to check value of the `MSR_GS_BASE` model specific register and if it is negative we came from kernel space, in other way we came from userspace:
|
||||
|
||||
```assembly
|
||||
.macro pushq_cfi reg
|
||||
pushq \reg
|
||||
CFI_ADJUST_CFA_OFFSET 8
|
||||
.endm
|
||||
movl $MSR_GS_BASE,%ecx
|
||||
rdmsr
|
||||
testl %edx,%edx
|
||||
js 1f
|
||||
```
|
||||
|
||||
Pay attention on the `$-1`. We already know that when an exception occurs, the processor pushes `ss`, `rsp`, `rflags`, `cs` and `rip` on the stack:
|
||||
In first two lines of code we read value of the `MSR_GS_BASE` model specific register into `edx:eax` pair. We can't set negative value to the `gs` from userspace. But from other side we know that direct mapping of the physical memory starts from the `0xffff880000000000` virtual address. In this way, `MSR_GS_BASE` will contain an address from `0xffff880000000000` to `0xffffc7ffffffffff`. After the `rdmsr` instruction will be executed, the smallest possible value in the `%edx` register will be - `0xffff8800` which is `-30720` in unsigned 4 bytes. That's why kernel space `gs` which points to start of `per-cpu` area will contain negative value.
|
||||
|
||||
```C
|
||||
#define RIP 16*8
|
||||
#define CS 17*8
|
||||
#define EFLAGS 18*8
|
||||
#define RSP 19*8
|
||||
#define SS 20*8
|
||||
After we pushed fake error code on the stack, we should allocate space for general purpose registers with:
|
||||
|
||||
```assembly
|
||||
ALLOC_PT_GPREGS_ON_STACK
|
||||
```
|
||||
|
||||
With the `pushq \reg` we denote that place before the `RIP` will contain error code of an exception:
|
||||
|
||||
```C
|
||||
#define ORIG_RAX 15*8
|
||||
```
|
||||
|
||||
The `ORIG_RAX` will contain error code of an exception, [IRQ](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) number on a hardware interrupt and system call number on [system call](http://en.wikipedia.org/wiki/System_call) entry. In the next step we can see the `ALLOC_PT_GPREGS_ON_STACK` macro which allocates space for the 15 general purpose registers on the stack:
|
||||
macro which is defined in the [arch/x86/entry/calling.h](https://github.com/torvalds/linux/blob/master/arch/x86/entry/calling.h) header file. This macro just allocates 15*8 bytes space on the stack to preserve general purpose registers:
|
||||
|
||||
```assembly
|
||||
.macro ALLOC_PT_GPREGS_ON_STACK addskip=0
|
||||
subq $15*8+\addskip, %rsp
|
||||
CFI_ADJUST_CFA_OFFSET 15*8+\addskip
|
||||
addq $-(15*8+\addskip), %rsp
|
||||
.endm
|
||||
```
|
||||
|
||||
After this we check `paranoid` and if it is set we check first three `CPL` bits. We compare it with the `3` and it allows us to know did we come from userspace or not:
|
||||
So the stack will look like this after execution of the `ALLOC_PT_GPREGS_ON_STACK`:
|
||||
|
||||
```
|
||||
+------------+
|
||||
+160 | %SS |
|
||||
+152 | %RSP |
|
||||
+144 | %RFLAGS |
|
||||
+136 | %CS |
|
||||
+128 | %RIP |
|
||||
+120 | ERROR CODE |
|
||||
|------------|
|
||||
+112 | |
|
||||
+104 | |
|
||||
+96 | |
|
||||
+88 | |
|
||||
+80 | |
|
||||
+72 | |
|
||||
+64 | |
|
||||
+56 | |
|
||||
+48 | |
|
||||
+40 | |
|
||||
+32 | |
|
||||
+24 | |
|
||||
+16 | |
|
||||
+8 | |
|
||||
+0 | | <- %RSP
|
||||
+------------+
|
||||
```
|
||||
|
||||
After we allocated space for general purpose registers, we do some checks to understand did an exception come from userspace or not and if yes, we should move back to an interrupted process stack or stay on exception stack:
|
||||
|
||||
```assembly
|
||||
.if \paranoid
|
||||
.if \paranoid == 1
|
||||
CFI_REMEMBER_STATE
|
||||
testl $3, CS(%rsp)
|
||||
jnz 1f
|
||||
.endif
|
||||
call paranoid_entry
|
||||
.if \paranoid == 1
|
||||
testb $3, CS(%rsp)
|
||||
jnz 1f
|
||||
.endif
|
||||
call paranoid_entry
|
||||
.else
|
||||
call error_entry
|
||||
call error_entry
|
||||
.endif
|
||||
```
|
||||
|
||||
If we came from userspace we jump on the label `1` which starts from the `call error_entry` instruction. The `error_entry` saves all registers in the `pt_regs` structure which presents an interrupt/exception stack frame and defined in the [arch/x86/include/uapi/asm/ptrace.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/ptrace.h). It saves common and extra registers on the stack with the:
|
||||
Let's consider all of these there cases in course.
|
||||
|
||||
An exception occured in userspace
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In the first let's consider a case when an exception has `paranoid=1` like our `debug` and `int3` exceptions. In this case we check selector from `CS` segment register and jump at `1f` label if we came from userspace or the `paranoid_entry` will be called in other way.
|
||||
|
||||
Let's consider first case when we came from userspace to an exception handler. As described above we should jump at `1` label. The `1` label starts from the call of the
|
||||
|
||||
```assembly
|
||||
call error_entry
|
||||
```
|
||||
|
||||
routine which saves all general purpose registers in the previously allocated area on the stack:
|
||||
|
||||
```assembly
|
||||
SAVE_C_REGS 8
|
||||
SAVE_EXTRA_REGS 8
|
||||
```
|
||||
|
||||
from `rdi` to `r15` and executes [swapgs](http://www.felixcloutier.com/x86/SWAPGS.html) instruction. This instruction provides a method for the Linux kernel to obtain a pointer to the kernel data structures and save the user's `gsbase`. After this we will exit from the `error_entry` with the `ret` instruction. After the `error_entry` finished to execute, since we came from userspace we need to switch on kernel interrupt stack:
|
||||
These both macros are defined in the [arch/x86/entry/calling.h](https://github.com/torvalds/linux/blob/master/arch/x86/entry/calling.h) header file and just move values of general purpose registers to a certain place at the stack, for example:
|
||||
|
||||
```assembly
|
||||
movq %rsp,%rdi
|
||||
call sync_regs
|
||||
.macro SAVE_EXTRA_REGS offset=0
|
||||
movq %r15, 0*8+\offset(%rsp)
|
||||
movq %r14, 1*8+\offset(%rsp)
|
||||
movq %r13, 2*8+\offset(%rsp)
|
||||
movq %r12, 3*8+\offset(%rsp)
|
||||
movq %rbp, 4*8+\offset(%rsp)
|
||||
movq %rbx, 5*8+\offset(%rsp)
|
||||
.endm
|
||||
```
|
||||
|
||||
We just save all registers to the `error_entry` in the `error_entry`, we put address of the `pt_regs` to the `rdi` and call `sync_regs` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c):
|
||||
After execution of `SAVE_C_REGS` and `SAVE_EXTRA_REGS` the stack will look:
|
||||
|
||||
```
|
||||
+------------+
|
||||
+160 | %SS |
|
||||
+152 | %RSP |
|
||||
+144 | %RFLAGS |
|
||||
+136 | %CS |
|
||||
+128 | %RIP |
|
||||
+120 | ERROR CODE |
|
||||
|------------|
|
||||
+112 | %RDI |
|
||||
+104 | %RSI |
|
||||
+96 | %RDX |
|
||||
+88 | %RCX |
|
||||
+80 | %RAX |
|
||||
+72 | %R8 |
|
||||
+64 | %R9 |
|
||||
+56 | %R10 |
|
||||
+48 | %R11 |
|
||||
+40 | %RBX |
|
||||
+32 | %RBP |
|
||||
+24 | %R12 |
|
||||
+16 | %R13 |
|
||||
+8 | %R14 |
|
||||
+0 | %R15 | <- %RSP
|
||||
+------------+
|
||||
```
|
||||
|
||||
After the kernel saved general purpose registers at the stack, we should check that we came from userspace space again with:
|
||||
|
||||
```assembly
|
||||
testb $3, CS+8(%rsp)
|
||||
jz .Lerror_kernelspace
|
||||
```
|
||||
|
||||
because we may have potentially fault if as described in documentation truncated `%RIP` was reported. Anyway, in both cases the [SWAPGS](http://www.felixcloutier.com/x86/SWAPGS.html) instruction will be executed and values from `MSR_KERNEL_GS_BASE` and `MSR_GS_BASE` will be swapped. From this moment the `%gs` register will point to the base address of kernel structures. So, the `SWAPGS` instruction is called and it was main point of the `error_entry` routing.
|
||||
|
||||
Now we can back to the `idtentry` macro. We may see following assembler code after the call of `error_entry`:
|
||||
|
||||
```assembly
|
||||
movq %rsp, %rdi
|
||||
call sync_regs
|
||||
```
|
||||
|
||||
Here we put base address of stack pointer `%rdi` register which will be first argument (according to [x86_64 ABI](https://www.uclibc.org/docs/psABI-x86_64.pdf)) of the `sync_regs` function and call this function which is defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) source code file:
|
||||
|
||||
```C
|
||||
asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs)
|
||||
@ -264,179 +370,125 @@ asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs)
|
||||
}
|
||||
```
|
||||
|
||||
This function switchs off the `IST` stack if we came from usermode. After this we switch on the stack which we got from the `sync_regs`:
|
||||
This function takes the result of the `task_ptr_regs` macro which is defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h) header file, stores it in the stack pointer and return it. The `task_ptr_regs` macro expands to the address of `thread.sp0` which represents pointer to the normal kernel stack:
|
||||
|
||||
```assembly
|
||||
movq %rax,%rsp
|
||||
movq %rsp,%rdi
|
||||
```C
|
||||
#define task_pt_regs(tsk) ((struct pt_regs *)(tsk)->thread.sp0 - 1)
|
||||
```
|
||||
|
||||
and put pointer of the `pt_regs` again in the `rdi`, and in the last step we call an exception handler:
|
||||
As we came from userspace, this means that exception handler will run in real process context. After we got stack pointer from the `sync_regs` we switch stack:
|
||||
|
||||
```assembly
|
||||
call \do_sym
|
||||
movq %rax, %rsp
|
||||
```
|
||||
|
||||
So, real exceptions handlers are `do_debug` and `do_int3` functions. We will see these function in this part, but little later. First of all let's look on the preparations before a processor will transfer control to an interrupt handler. In another way if `paranoid` is set, but it is not 1, we call `paranoid_entry` which makes almost the same that `error_entry`, but it checks current mode with more slow but accurate way:
|
||||
The last two steps before an exception handler will call secondary handler are:
|
||||
|
||||
1. Passing pointer to `pt_regs` structure which contains preserved general purpose registers to the `%rdi` register:
|
||||
|
||||
```assembly
|
||||
movq %rsp, %rdi
|
||||
```
|
||||
|
||||
as it will be passed as first parameter of secondary exception handler.
|
||||
|
||||
2. Pass error code to the `%rsi` register as it will be second argument of an exception handler and set it to `-1` on the stack for the same purpose as we did it before - to prevent restart of a system call:
|
||||
|
||||
```
|
||||
.if \has_error_code
|
||||
movq ORIG_RAX(%rsp), %rsi
|
||||
movq $-1, ORIG_RAX(%rsp)
|
||||
.else
|
||||
xorl %esi, %esi
|
||||
.endif
|
||||
```
|
||||
|
||||
Additionally you may see that we zeroed the `%esi` register above in a case if an exception does not provide error code.
|
||||
|
||||
In the end we just call secondary exception handler:
|
||||
|
||||
```assembly
|
||||
call \do_sym
|
||||
```
|
||||
|
||||
which:
|
||||
|
||||
```C
|
||||
dotraplinkage void do_debug(struct pt_regs *regs, long error_code);
|
||||
```
|
||||
|
||||
will be for `debug` exception and:
|
||||
|
||||
```C
|
||||
dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code);
|
||||
```
|
||||
|
||||
will be for `int 3` exception. In this part we will not see implementations of secondary handlers, because of they are very specific, but will see some of them in one of next parts.
|
||||
|
||||
We just considered first case when an exception occured in userspace. Let's consider last two.
|
||||
|
||||
An exception with paranoid > 0 occured in kernelspace
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In this case an exception was occured in kernelspace and `idtentry` macro is defined with `paranoid=1` for this exception. This value of `paranoid` means that we should use slower way that we saw in the beginning of this part to check do we really came from kernelspace or not. The `paranoid_entry` routing allows us to know this:
|
||||
|
||||
```assembly
|
||||
ENTRY(paranoid_entry)
|
||||
cld
|
||||
SAVE_C_REGS 8
|
||||
SAVE_EXTRA_REGS 8
|
||||
...
|
||||
...
|
||||
movl $MSR_GS_BASE,%ecx
|
||||
movl $1, %ebx
|
||||
movl $MSR_GS_BASE, %ecx
|
||||
rdmsr
|
||||
testl %edx,%edx
|
||||
js 1f /* negative -> in kernel */
|
||||
testl %edx, %edx
|
||||
js 1f
|
||||
SWAPGS
|
||||
...
|
||||
...
|
||||
ret
|
||||
xorl %ebx, %ebx
|
||||
1: ret
|
||||
END(paranoid_entry)
|
||||
```
|
||||
|
||||
If `edx` wll be negative, we are in the kernel mode. As we store all registers on the stack, check that we are in the kernel mode, we need to setup `IST` stack if it is set for a given exception, call an exception handler and restore the exception stack:
|
||||
As you may see, this function representes the same that we covered before. We use second (slow) method to get information about previous state of an interrupted task. As we checked this and executed `SWAPGS` in a case if we came from userspace, we should to do the same that we did before: We need to put pointer to a strucutre which holds general purpose registers to the `%rdi` (which will be first parameter of a secondary handler) and put error code if an exception provides it to the `%rsi` (which will be second parameter of a secondary handler):
|
||||
|
||||
```assembly
|
||||
.if \shift_ist != -1
|
||||
subq $EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
|
||||
.endif
|
||||
movq %rsp, %rdi
|
||||
|
||||
call \do_sym
|
||||
|
||||
.if \shift_ist != -1
|
||||
addq $EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
|
||||
.endif
|
||||
.if \has_error_code
|
||||
movq ORIG_RAX(%rsp), %rsi
|
||||
movq $-1, ORIG_RAX(%rsp)
|
||||
.else
|
||||
xorl %esi, %esi
|
||||
.endif
|
||||
```
|
||||
|
||||
The last step when an exception handler will finish it's work all registers will be restored from the stack with the `RESTORE_C_REGS` and `RESTORE_EXTRA_REGS` macros and control will be returned an interrupted task. That's all. Now we know about preparation before an interrupt/exception handler will start to execute and we can go directly to the implementation of the handlers.
|
||||
The last step before a secondary handler of an exception will be called is cleanup of new `IST` stack fram:
|
||||
|
||||
Implementation of ainterrupts and exceptions handlers
|
||||
```assembly
|
||||
.if \shift_ist != -1
|
||||
subq $EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
|
||||
.endif
|
||||
```
|
||||
|
||||
You may remember that we passed the `shift_ist` as argument of the `idtentry` macro. Here we check its value and if its not equal to `-1`, we get pointer to a stack from `Interrupt Stack Table` by `shift_ist` index and setup it.
|
||||
|
||||
In the end of this second way we just call secondary exception handler as we did it before:
|
||||
|
||||
```assembly
|
||||
call \do_sym
|
||||
```
|
||||
|
||||
The last method is similar to previous both, but an exception occured with `paranoid=0` and we may use fast method determination of where we are from.
|
||||
|
||||
Exit from an exception handler
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Both handlers `do_debug` and `do_int3` defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c) source code file and have two similar things: All interrupts/exceptions handlers marked with the `dotraplinkage` prefix that expands to the:
|
||||
After secondary handler will finish its works, we will return to the `idtentry` macro and the next step will be jump to the `error_exit`:
|
||||
|
||||
```C
|
||||
#define dotraplinkage __visible
|
||||
#define __visible __attribute__((externally_visible))
|
||||
```assembly
|
||||
jmp error_exit
|
||||
```
|
||||
|
||||
which tells to compiler that something else uses this function (in our case these functions are called from the assembly interrupt preparation code). And also they takes two parameters:
|
||||
|
||||
* pointer to the `pt_regs` structure which contains registers of the interrupted task;
|
||||
* error code.
|
||||
|
||||
First of all let's consider `do_debug` handler. This function starts from the getting previous state with the `ist_enter` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). We call it because we need to know, did we come to the interrupt handler from the kernel mode or user mode.
|
||||
|
||||
```C
|
||||
prev_state = ist_enter(regs);
|
||||
```
|
||||
|
||||
The `ist_enter` function returns previous state context state and executes a couple preprartions before we continue to handle an exception. It starts from the check of the previous mode with the `user_mode_vm` macro. It takes `pt_regs` structure which contains a set of registers of the interrupted task and returns `1` if we came from userspace and `0` if we came from kernel space. According to the previous mode we execute `exception_enter` if we are from the userspace or inform [RCU](https://en.wikipedia.org/wiki/Read-copy-update) if we are from krenel space:
|
||||
|
||||
```C
|
||||
...
|
||||
if (user_mode_vm(regs)) {
|
||||
prev_state = exception_enter();
|
||||
} else {
|
||||
rcu_nmi_enter();
|
||||
prev_state = IN_KERNEL;
|
||||
}
|
||||
...
|
||||
...
|
||||
...
|
||||
return prev_state;
|
||||
```
|
||||
|
||||
After this we load the `DR6` debug registers to the `dr6` variable with the call of the `get_debugreg` macro from the [arch/x86/include/asm/debugreg.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/debugreg.h):
|
||||
|
||||
```C
|
||||
get_debugreg(dr6, 6);
|
||||
dr6 &= ~DR6_RESERVED;
|
||||
```
|
||||
|
||||
The `DR6` debug register is debug status register contains information about the reason for stopping the `#DB` or debug exception handler. After we loaded its value to the `dr6` variable we filter out all reserved bits (`4:12` bits). In the next step we check `dr6` register and previous state with the following `if` condition expression:
|
||||
|
||||
```C
|
||||
if (!dr6 && user_mode_vm(regs))
|
||||
user_icebp = 1;
|
||||
```
|
||||
|
||||
If `dr6` does not show any reasons why we caught this trap we set `user_icebp` to one which means that user-code wants to get [SIGTRAP](https://en.wikipedia.org/wiki/Unix_signal#SIGTRAP) signal. In the next step we check was it [kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt) trap and if yes we go to exit:
|
||||
|
||||
```C
|
||||
if ((dr6 & DR_STEP) && kmemcheck_trap(regs))
|
||||
goto exit;
|
||||
```
|
||||
|
||||
After we did all these checks, we clear the `dr6` register, clear the `DEBUGCTLMSR_BTF` flag which provides single-step on branches debugging, set `dr6` register for the current thread and increase `debug_stack_usage` [per-cpu]([Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)) variable with the:
|
||||
|
||||
```C
|
||||
set_debugreg(0, 6);
|
||||
clear_tsk_thread_flag(tsk, TIF_BLOCKSTEP);
|
||||
tsk->thread.debugreg6 = dr6;
|
||||
debug_stack_usage_inc();
|
||||
```
|
||||
|
||||
As we saved `dr6`, we can allow irqs:
|
||||
|
||||
```C
|
||||
static inline void preempt_conditional_sti(struct pt_regs *regs)
|
||||
{
|
||||
preempt_count_inc();
|
||||
if (regs->flags & X86_EFLAGS_IF)
|
||||
local_irq_enable();
|
||||
}
|
||||
```
|
||||
|
||||
more about `local_irq_enabled` and related stuff you can read in the second part about [interrupts handling in the Linux kernel](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-2.html). In the next step we check the previous mode was [virtual 8086](https://en.wikipedia.org/wiki/Virtual_8086_mode) and handle the trap:
|
||||
|
||||
```C
|
||||
if (regs->flags & X86_VM_MASK) {
|
||||
handle_vm86_trap((struct kernel_vm86_regs *) regs, error_code, X86_TRAP_DB);
|
||||
preempt_conditional_cli(regs);
|
||||
debug_stack_usage_dec();
|
||||
goto exit;
|
||||
}
|
||||
...
|
||||
...
|
||||
...
|
||||
exit:
|
||||
ist_exit(regs, prev_state);
|
||||
```
|
||||
|
||||
If we came not from the virtual 8086 mode, we need to check `dr6` register and previous mode as we did it above. Here we check if step mode debugging is
|
||||
enabled and we are not from the user mode, we enabled step mode debugging in the `dr6` copy in the current thread, set `TIF_SINGLE_STEP` flag and re-enable [Trap flag](https://en.wikipedia.org/wiki/Trap_flag) for the user mode:
|
||||
|
||||
```C
|
||||
if ((dr6 & DR_STEP) && !user_mode(regs)) {
|
||||
tsk->thread.debugreg6 &= ~DR_STEP;
|
||||
set_tsk_thread_flag(tsk, TIF_SINGLESTEP);
|
||||
regs->flags &= ~X86_EFLAGS_TF;
|
||||
}
|
||||
```
|
||||
|
||||
Then we get `SIGTRAP` signal code:
|
||||
|
||||
```C
|
||||
si_code = get_si_code(tsk->thread.debugreg6);
|
||||
```
|
||||
|
||||
and send it for user icebp traps:
|
||||
|
||||
```C
|
||||
if (tsk->thread.debugreg6 & (DR_STEP | DR_TRAP_BITS) || user_icebp)
|
||||
send_sigtrap(tsk, regs, error_code, si_code);
|
||||
preempt_conditional_cli(regs);
|
||||
debug_stack_usage_dec();
|
||||
exit:
|
||||
ist_exit(regs, prev_state);
|
||||
```
|
||||
|
||||
In the end we disable `irqs`, decrease value of the `debug_stack_usage` and exit from the exception handler with the `ist_exit` function.
|
||||
|
||||
The second exception handler is `do_int3` defined in the same source code file - [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). In the `do_int3` we make almost the same that in the `do_debug` handler. We get the previous state with the `ist_enter`, increase and decrease the `debug_stack_usage` per-cpu variable, enable and disable local interrupts. But of course there is one difference between these two handlers. We need to lock and then sync processor cores during breakpoint patching.
|
||||
routine. The `error_exit` function defined in the same [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly source code file and the main goal of this function is to know where we are from (from userspace or kernelspace) and execute `SWPAGS` depends on this. Restore registers to previous state and execute `iret` instruction to transfer control to an interrupted task.
|
||||
|
||||
That's all.
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user