This is the third part of the [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) about an interrupts and an exceptions handling and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) we stoped in the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blame/master/arch/x86/kernel/setup.c) on the setting of the two exceptions handlers for the two following exceptions:
*`#DB` - debug exception, transfers control from the interrupted process to the debug handler;
*`#BP` - breakpoint exception, caused by the `int 3` instruction.
These exceptions allow the `x86_64` architecture to have early exception processing for the purpose of debugging via the [kgdb](https://en.wikipedia.org/wiki/KGDB).
As you can remember we set these exceptions handlers in the `early_trap_init` function:
from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). We already saw implementation of the `set_intr_gate_ist` and `set_system_intr_gate_ist` functions in the previous part and now we will look on the implementation of these early exceptions handlers.
Ok, we set the interrupts gates in the `early_trap_init` function for the `#DB` and `#BP` exceptions and now time is to look on their handlers. But first of all let's look on these exceptions. The first exceptions - `#DB` or debug exception occurs when a debug event occurs, for example attempt to change the contents of a [debug register](http://en.wikipedia.org/wiki/X86_debug_register). Debug registers are special registers which present in processors starting from the [Intel 80386](http://en.wikipedia.org/wiki/Intel_80386) and as you can understand from its name they are used for debugging. These registers allow to set breakpoints on the code and read or write data to trace, thus tracking the place of errors. The debug registers are privileged resources available and the program in either real-address or protected mode at `CPL` is `0`, that's why we have used `set_intr_gate_ist` for the `#DB`, but not the `set_system_intr_gate_ist`. The verctor number of the `#DB` exceptions is `1` (we pass it as `X86_TRAP_DB`) and has no error code:
The second is `#BP` or breakpoint exception occurs when processor executes the [INT 3](http://en.wikipedia.org/wiki/INT_%28x86_instruction%29#INT_3) instruction. We can add it anywhere in our code, for example let's look on the simple program:
```C
// breakpoint.c
#include <stdio.h>
int main() {
int i;
while (i <6){
printf("i equal to: %d\n", i);
__asm__("int3");
++i;
}
}
```
If we will compile and run this program, we will see following output:
```
$ gcc breakpoint.c -o breakpoint
i equal to: 0
Trace/breakpoint trap
```
But if will run it with gdb, we will see our breakpoint and can continue execution of our program:
```
$ gdb breakpoint
...
...
...
(gdb) run
Starting program: /home/alex/breakpoints
i equal to: 0
Program received signal SIGTRAP, Trace/breakpoint trap.
As you can note, the `set_intr_gate_ist` and `set_system_intr_gate_ist` functions takes an addresses of the exceptions handlers in the second parameter:
*`&debug`;
*`&int3`.
You will not find these functions in the C code. All that can be found in in the `*.c/*.h` files only definition of this functions in the [arch/x86/include/asm/traps.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/traps.h):
```C
asmlinkage void debug(void);
asmlinkage void int3(void);
```
But we can see `asmlinkage` descriptor here. The `asmlinkage` is the special specificator of the [gcc](http://en.wikipedia.org/wiki/GNU_Compiler_Collection). Actually for a `C` functions which are will be called from assembly, we need in explicit declaration of the function calling convention. In our case, if function maked with `asmlinkage` descriptor, then `gcc` will compile the function to retrieve parameters from stack. So, both handlers are defined in the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) assembly source code file with the `idtentry` macro:
Actually `debug` and `int3` are not interrupts handlers. Remember that before we can execute an interrupt/exception handler, we need to do some preparations as:
* When an interrupt or exception occured, the processor uses an exception or interrupt vector as an index to a descriptor in the `IDT`;
* In legacy mode `ss:esp` registers are pushed on the stack only if privilege level changed. In 64-bit mode `ss:rsp` pushed on the stack everytime;
* During stack switching with `IST` the new `ss` selector is forced to null. Old `ss` and `rsp` are pushed on the new stack.
* The `rflags`, `cs`, `rip` and error code pushed on the stack;
* Control transfered to an interrupt handler;
* After an interrupt handler will finish its work and finishes with the `iret` instruction, old `ss` will be poped from the stack and loaded to the `ss` register.
*`ss:rsp` will be popped from the stack unconditionally in the 64-bit mode and will be popped only if there is a privilege level change in legacy mode.
*`iret` instruction will restore `rip`, `cs` and `rflags`;
* Interrupted program will continue its execution.
```
+--------------------+
+40 | ss |
+32 | rsp |
+24 | rflags |
+16 | cs |
+8 | rip |
0 | error code |
+--------------------+
```
Now we can see on the preparations before a process will transfer control to an interrupt/exception handler from practical side. As I already wrote above the first thirteen exceptions handlers defined in the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) assembly file with the [idtentry](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S#L967) macro:
The differences are only in the global name and name of exceptions handlers. Now let's look how `idtentry` macro implemented. It starts from the two checks:
```assembly
.if \shift_ist != -1 && \paranoid == 0
.error "using shift_ist requires paranoid=1"
.endif
.if \has_error_code
XCPT_FRAME
.else
INTR_FRAME
.endif
```
First check makes the check that an exceptions uses `Interrupt stack table` and `paranoid` is set, in other way it emits the erorr with the [.error](https://sourceware.org/binutils/docs/as/Error.html#Error) directive. The second `if` clause checks existence of an error code and calls `XCPT_FRAME` or `INTR_FRAME` macros depends on it. These macros just expand to the set of [CFI directives](https://sourceware.org/binutils/docs/as/CFI-directives.html) which are used by `GNU AS` to manage call frames. The `CFI` directives are used only to generate [dwarf2](http://en.wikipedia.org/wiki/DWARF) unwind information for better backtraces and they don't change any code, so we will not go into detail about it and from this point I will skip all code which is related to these directives. In the next step we check error code again and push it on the stack if an exception has it with the:
```assembly
.ifeq \has_error_code
pushq_cfi $-1
.endif
```
The `pushq_cfi` macro defined in the [arch/x86/include/asm/dwarf2.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/dwarf2.h) and expands to the `pushq` instruction which pushes given error code:
```assembly
.macro pushq_cfi reg
pushq \reg
CFI_ADJUST_CFA_OFFSET 8
.endm
```
Pay attention on the `$-1`. We already know that when an exception occrus, the processor pushes `ss`, `rsp`, `rflags`, `cs` and `rip` on the stack:
```C
#define RIP 16*8
#define CS 17*8
#define EFLAGS 18*8
#define RSP 19*8
#define SS 20*8
```
With the `pushq \reg` we denote that place before the `RIP` will contain error code of an exception:
```C
#define ORIG_RAX 15*8
```
The `ORIG_RAX` will contain error code of an exception, [IRQ](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) number on a hardware interrupt and system call number on [system call](http://en.wikipedia.org/wiki/System_call) entry. In the next step we can see thr `ALLOC_PT_GPREGS_ON_STACK` macro which allocates space for the 15 general purpose registers on the stack:
```assembly
.macro ALLOC_PT_GPREGS_ON_STACK addskip=0
subq $15*8+\addskip, %rsp
CFI_ADJUST_CFA_OFFSET 15*8+\addskip
.endm
```
After this we check `paranoid` and if it is set we check first three `CPL` bits. We compare it with the `3` and it allows us to know did we come from userspace or not:
```assembly
.if \paranoid
.if \paranoid == 1
CFI_REMEMBER_STATE
testl $3, CS(%rsp)
jnz 1f
.endif
call paranoid_entry
.else
call error_entry
.endif
```
If we came from userspace we jump on the label `1` which starts from the `call error_entry` instruction. The `error_entry` saves all registers in the `pt_regs` structure which presetens an interrupt/exception stack frame and defined in the [arch/x86/include/uapi/asm/ptrace.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/ptrace.h). It saves common and extra registers on the stack with the:
```assembly
SAVE_C_REGS 8
SAVE_EXTRA_REGS 8
```
from `rdi` to `r15` and executes [swapgs](http://www.felixcloutier.com/x86/SWAPGS.html) instruction. This instruction provides a method to for the Linux kernel to obtain a pointer to the kernel data structures and save the user's `gsbase`. After this we will exit from the `error_entry` with the `ret` instruction. After the `error_entry` finished to execute, since we came from userspace we need to switch on kernel interrupt stack:
```assembly
movq %rsp,%rdi
call sync_regs
```
We just save all registers to the `error_entry` in the `error_entry`, we put address of the `pt_regs` to the `rdi` and call `sync_regs` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c):
This function switchs off the `IST` stack if we came from usermode. After this we switch on the stack which we got from the `sync_regs`:
```assembly
movq %rax,%rsp
movq %rsp,%rdi
```
and put pointer of the `pt_regs` again in the `rdi`, and in the last step we call an exception handler:
```assembly
call \do_sym
```
So, realy exceptions handlers are `do_debug` and `do_int3` functions. We will see these function in this part, but little later. First of all let's look on the preparations before a processor will transfer control to an interrupt handler. In another way if `paranoid` is set, but it is not 1, we call `paranoid_entry` which makes almost the same that `error_entry`, but it checks current mode with more slow but accurate way:
```assembly
ENTRY(paranoid_entry)
SAVE_C_REGS 8
SAVE_EXTRA_REGS 8
...
...
movl $MSR_GS_BASE,%ecx
rdmsr
testl %edx,%edx
js 1f /* negative -> in kernel */
SWAPGS
...
...
ret
END(paranoid_entry)
```
If `edx` wll be negative, we are in the kernel mode. As we store all registers on the stack, check that we are in the kernel mode, we need to setup `IST` stack if it is set for a given exception, call an exception handler and restore the exception stack:
```assembly
.if \shift_ist != -1
subq $EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
.endif
call \do_sym
.if \shift_ist != -1
addq $EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
.endif
```
The last step when an exception handler will finish it's work all registers will be restored from the stack with the `RESTORE_C_REGS` and `RESTORE_EXTRA_REGS` macros and control will be returned an interrupted task. That's all. Now we know about preparation before an interrupt/exception handler will start to execute and we can go directly to the implementation of the handlers.
Implementation of ainterrupts and exceptions handlers
Both handlers `do_debug` and `do_int3` defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c) source code file and have two similar things: All interrupts/exceptions handlers marked with the `dotraplinkage` prefix that expands to the:
which tells to compiler that something else uses this function (in our case these functions are called from the assembly interrupt preparation code). And also they takes two parameters:
* pointer to the `pt_regs` structure which contains registers of the interrupted task;
* error code.
First of all let's consider `do_debug` handler. This function starts from the getting previous state with the `ist_enter` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). We call it because we need to know, did we come to the interrupt handler from the kernel mode or user mode.
```C
prev_state = ist_enter(regs);
```
The `ist_enter` function returns previous state context state and executes a couple preprartions before we continue to handle an exception. It starts from the check of the previous mode with the `user_mode_vm` macro. It takes `pt_regs` structure which contains a set of registers of the interrupted task and returns `1` if we came from userspace and `0` if we came from kernel space. According to the previous mode we execute `exception_enter` if we are from the userspace or inform [RCU](https://en.wikipedia.org/wiki/Read-copy-update) if we are from krenel space:
```C
...
if (user_mode_vm(regs)) {
prev_state = exception_enter();
} else {
rcu_nmi_enter();
prev_state = IN_KERNEL;
}
...
...
...
return prev_state;
```
After this we load the `DR6` debug registers to the `dr6` variable with the call of the `get_debugreg` macro from the [arch/x86/include/asm/debugreg.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/debugreg.h):
```C
get_debugreg(dr6, 6);
dr6 &= ~DR6_RESERVED;
```
The `DR6` debug register is debug status register contains information about the reason for stopping the `#DB` or debug exception handler. After we loaded its value to the `dr6` variable we filter out all reserved bits (`4:12` bits). In the next step we check `dr6` register and previous state with the following `if` condition expression:
```C
if (!dr6 && user_mode_vm(regs))
user_icebp = 1;
```
If `dr6` does not show any reasons why we caught this trap we set `user_icebp` to one which means that user-code wants to get [SIGTRAP](https://en.wikipedia.org/wiki/Unix_signal#SIGTRAP) signal. In the next step we check was it [kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt) trap and if yes we go to exit:
```C
if ((dr6 & DR_STEP) && kmemcheck_trap(regs))
goto exit;
```
After we did all these checks, we clear the `dr6` register, clear the `DEBUGCTLMSR_BTF` flag which provides single-step on branches debugging, set `dr6` register for the current thread and increase `debug_stack_usage` [per-cpu]([Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)) variable with the:
more about `local_irq_enabled` and related stuff you can read in the second part about [interrupts handling in the Linux kernel](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-2.html). In the next step we check the previous mode was [virtual 8086](https://en.wikipedia.org/wiki/Virtual_8086_mode) and handle the trap:
If we came not from the virtual 8086 mode, we need to check `dr6` register and previous mode as we did it above. Here we check if step mode debugging is
enabled and we are not from the user mode, we enabled step mode debugging in the `dr6` copy in the current thread, set `TIF_SINGLE_STEP` falg and re-enable [Trap flag](https://en.wikipedia.org/wiki/Trap_flag) for the user mode:
```C
if ((dr6 & DR_STEP) && !user_mode(regs)) {
tsk->thread.debugreg6 &= ~DR_STEP;
set_tsk_thread_flag(tsk, TIF_SINGLESTEP);
regs->flags &= ~X86_EFLAGS_TF;
}
```
Then we get `SIGTRAP` signal code:
```C
si_code = get_si_code(tsk->thread.debugreg6);
```
and send it for user icebp traps:
```C
if (tsk->thread.debugreg6 & (DR_STEP | DR_TRAP_BITS) || user_icebp)
send_sigtrap(tsk, regs, error_code, si_code);
preempt_conditional_cli(regs);
debug_stack_usage_dec();
exit:
ist_exit(regs, prev_state);
```
In the end we disabled `irqs`, decrement value of the `debug_stack_usage` and exit from the exception handler with the `ist_exit` function.
The second exception handler is `do_int3` defined in the same source code file - [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). In the `do_int3` we makes almost the same that in the `do_debug` handler. We get the previous state with the `ist_enter`, increment and decrement the `debug_stack_usage` per-cpu variable, enabled and disable local interrupts. But of course there is one difference between these two handlers. We need to lock and than sync processor cores during breakpoint patching.
It is the end of the third part about interrupts and interrupt handling in the Linux kernel. We saw the initialization of the [Interrupt descriptor table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table) in the previous part with the `#DB` and `#BP` gates and started to dive into preparation before control will be transfered to an exception handler and implementation of some interrupt handlers in this part. In the next part we will continue to dive into this theme and will go next by the `setup_arch` function and will try to understand interrupts handling related stuff.
If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**