mirror of
https://github.com/0xAX/linux-insides.git
synced 2024-12-22 22:58:08 +00:00
548 lines
33 KiB
Markdown
548 lines
33 KiB
Markdown
Interrupts and Interrupt Handling. Part 2.
|
|
================================================================================
|
|
|
|
Start to dive into interrupt and exceptions handling in the Linux kernel
|
|
--------------------------------------------------------------------------------
|
|
|
|
We saw some theory about interrupts and exception handling in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-1.html) and as I already wrote in that part, we will start to dive into interrupts and exceptions in the Linux kernel source code in this part. As you already can note, the previous part mostly described theoretical aspects and in this part we will start to dive directly into the Linux kernel source code. We will start to do it as we did it in other chapters, from the very early places. We will not see the Linux kernel source code from the earliest [code lines](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/header.S#L292) as we saw it for example in the [Linux kernel booting process](https://0xax.gitbooks.io/linux-insides/content/Booting/index.html) chapter, but we will start from the earliest code which is related to the interrupts and exceptions. In this part we will try to go through the all interrupts and exceptions related stuff which we can find in the Linux kernel source code.
|
|
|
|
If you've read the previous parts, you can remember that the earliest place in the Linux kernel `x86_64` architecture-specific source code which is related to the interrupt is located in the [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/pm.c) source code file and represents the first setup of the [Interrupt Descriptor Table](http://en.wikipedia.org/wiki/Interrupt_descriptor_table). It occurs right before the transition into the [protected mode](http://en.wikipedia.org/wiki/Protected_mode) in the `go_to_protected_mode` function by the call of the `setup_idt`:
|
|
|
|
```C
|
|
void go_to_protected_mode(void)
|
|
{
|
|
...
|
|
setup_idt();
|
|
...
|
|
}
|
|
```
|
|
|
|
The `setup_idt` function is defined in the same source code file as the `go_to_protected_mode` function and just loads the address of the `NULL` interrupts descriptor table:
|
|
|
|
```C
|
|
static void setup_idt(void)
|
|
{
|
|
static const struct gdt_ptr null_idt = {0, 0};
|
|
asm volatile("lidtl %0" : : "m" (null_idt));
|
|
}
|
|
```
|
|
|
|
where `gdt_ptr` represents a special 48-bit `GDTR` register which must contain the base address of the `Global Descriptor Table`:
|
|
|
|
```C
|
|
struct gdt_ptr {
|
|
u16 len;
|
|
u32 ptr;
|
|
} __attribute__((packed));
|
|
```
|
|
|
|
Of course in our case the `gdt_ptr` does not represent the `GDTR` register, but `IDTR` since we set `Interrupt Descriptor Table`. You will not find an `idt_ptr` structure, because if it had been in the Linux kernel source code, it would have been the same as `gdt_ptr` but with different name. So, as you can understand there is no sense to have two similar structures which differ only by name. You can note here, that we do not fill the `Interrupt Descriptor Table` with entries, because it is too early to handle any interrupts or exceptions at this point. That's why we just fill the `IDT` with `NULL`.
|
|
|
|
After the setup of the [Interrupt descriptor table](http://en.wikipedia.org/wiki/Interrupt_descriptor_table), [Global Descriptor Table](http://en.wikipedia.org/wiki/GDT) and other stuff we jump into [protected mode](http://en.wikipedia.org/wiki/Protected_mode) in the - [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/pmjump.S). You can read more about it in the [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html) which describes the transition to protected mode.
|
|
|
|
We already know from the earliest parts that entry to protected mode is located in the `boot_params.hdr.code32_start` and you can see that we pass the entry of the protected mode and `boot_params` to the `protected_mode_jump` in the end of the [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/pm.c):
|
|
|
|
```C
|
|
protected_mode_jump(boot_params.hdr.code32_start,
|
|
(u32)&boot_params + (ds() << 4));
|
|
```
|
|
|
|
The `protected_mode_jump` is defined in the [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/pmjump.S) and gets these two parameters in the `ax` and `dx` registers using one of the [8086](http://en.wikipedia.org/wiki/Intel_8086) calling [conventions](http://en.wikipedia.org/wiki/X86_calling_conventions#List_of_x86_calling_conventions):
|
|
|
|
```assembly
|
|
GLOBAL(protected_mode_jump)
|
|
...
|
|
...
|
|
...
|
|
.byte 0x66, 0xea # ljmpl opcode
|
|
2: .long in_pm32 # offset
|
|
.word __BOOT_CS # segment
|
|
...
|
|
...
|
|
...
|
|
ENDPROC(protected_mode_jump)
|
|
```
|
|
|
|
where `in_pm32` contains a jump to the 32-bit entry point:
|
|
|
|
```assembly
|
|
GLOBAL(in_pm32)
|
|
...
|
|
...
|
|
jmpl *%eax // %eax contains address of the `startup_32`
|
|
...
|
|
...
|
|
ENDPROC(in_pm32)
|
|
```
|
|
|
|
As you can remember the 32-bit entry point is in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S) assembly file, although it contains `_64` in its name. We can see the two similar files in the `arch/x86/boot/compressed` directory:
|
|
|
|
* `arch/x86/boot/compressed/head_32.S`.
|
|
* `arch/x86/boot/compressed/head_64.S`;
|
|
|
|
But the 32-bit mode entry point is the second file in our case. The first file is not even compiled for `x86_64`. Let's look at the [arch/x86/boot/compressed/Makefile](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/Makefile):
|
|
|
|
```
|
|
vmlinux-objs-y := $(obj)/vmlinux.lds $(obj)/head_$(BITS).o $(obj)/misc.o \
|
|
...
|
|
...
|
|
```
|
|
|
|
We can see here that `head_*` depends on the `$(BITS)` variable which depends on the architecture. You can find it in the [arch/x86/Makefile](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/Makefile):
|
|
|
|
```
|
|
ifeq ($(CONFIG_X86_32),y)
|
|
...
|
|
BITS := 32
|
|
else
|
|
BITS := 64
|
|
...
|
|
endif
|
|
```
|
|
|
|
Now as we jumped on the `startup_32` from the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S) we will not find anything related to the interrupt handling here. The `startup_32` contains code that makes preparations before the transition into [long mode](http://en.wikipedia.org/wiki/Long_mode) and directly jumps in to it. The `long mode` entry is located in `startup_64` and it makes preparations before the [kernel decompression](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) that occurs in the `decompress_kernel` from the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/misc.c). After the kernel is decompressed, we jump on the `startup_64` from the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head_64.S). In the `startup_64` we start to build identity-mapped pages. After we have built identity-mapped pages, checked the [NX](http://en.wikipedia.org/wiki/NX_bit) bit, setup the `Extended Feature Enable Register` (see in links), and updated the early `Global Descriptor Table` with the `lgdt` instruction, we need to setup `gs` register with the following code:
|
|
|
|
```assembly
|
|
movl $MSR_GS_BASE,%ecx
|
|
movl initial_gs(%rip),%eax
|
|
movl initial_gs+4(%rip),%edx
|
|
wrmsr
|
|
```
|
|
|
|
We already saw this code in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-1.html). First of all pay attention on the last `wrmsr` instruction. This instruction writes data from the `edx:eax` registers to the [model specific register](http://en.wikipedia.org/wiki/Model-specific_register) specified by the `ecx` register. We can see that `ecx` contains `$MSR_GS_BASE` which is declared in the [arch/x86/include/uapi/asm/msr-index.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/uapi/asm/msr-index.h) and looks like:
|
|
|
|
```C
|
|
#define MSR_GS_BASE 0xc0000101
|
|
```
|
|
|
|
From this we can understand that `MSR_GS_BASE` defines the number of the `model specific register`. Since registers `cs`, `ds`, `es`, and `ss` are not used in the 64-bit mode, their fields are ignored. But we can access memory over `fs` and `gs` registers. The model specific register provides a `back door` to the hidden parts of these segment registers and allows to use 64-bit base address for segment register addressed by the `fs` and `gs`. So the `MSR_GS_BASE` is the hidden part and this part is mapped on the `GS.base` field. Let's look on the `initial_gs`:
|
|
|
|
```assembly
|
|
GLOBAL(initial_gs)
|
|
.quad INIT_PER_CPU_VAR(irq_stack_union)
|
|
```
|
|
|
|
We pass `irq_stack_union` symbol to the `INIT_PER_CPU_VAR` macro which just concatenates the `init_per_cpu__` prefix with the given symbol. In our case we will get the `init_per_cpu__irq_stack_union` symbol. Let's look at the [linker](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/vmlinux.lds.S) script. There we can see following definition:
|
|
|
|
```
|
|
#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load
|
|
INIT_PER_CPU(irq_stack_union);
|
|
```
|
|
|
|
It tells us that the address of the `init_per_cpu__irq_stack_union` will be `irq_stack_union + __per_cpu_load`. Now we need to understand where `init_per_cpu__irq_stack_union` and `__per_cpu_load` are what they mean. The first `irq_stack_union` is defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/processor.h) with the `DECLARE_INIT_PER_CPU` macro which expands to call the `init_per_cpu_var` macro:
|
|
|
|
```C
|
|
DECLARE_INIT_PER_CPU(irq_stack_union);
|
|
|
|
#define DECLARE_INIT_PER_CPU(var) \
|
|
extern typeof(per_cpu_var(var)) init_per_cpu_var(var)
|
|
|
|
#define init_per_cpu_var(var) init_per_cpu__##var
|
|
```
|
|
|
|
If we expand all macros we will get the same `init_per_cpu__irq_stack_union` as we got after expanding the `INIT_PER_CPU` macro, but you can note that it is not just a symbol, but a variable. Let's look at the `typeof(per_cpu_var(var))` expression. Our `var` is `irq_stack_union` and the `per_cpu_var` macro is defined in the [arch/x86/include/asm/percpu.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/percpu.h):
|
|
|
|
```C
|
|
#define PER_CPU_VAR(var) %__percpu_seg:var
|
|
```
|
|
|
|
where:
|
|
|
|
```C
|
|
#ifdef CONFIG_X86_64
|
|
#define __percpu_seg gs
|
|
endif
|
|
```
|
|
|
|
So, we are accessing `gs:irq_stack_union` and getting its type which is `irq_union`. Ok, we defined the first variable and know its address, now let's look at the second `__per_cpu_load` symbol. There are a couple of `per-cpu` variables which are located after this symbol. The `__per_cpu_load` is defined in the [include/asm-generic/sections.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/asm-generic-sections.h):
|
|
|
|
```C
|
|
extern char __per_cpu_load[], __per_cpu_start[], __per_cpu_end[];
|
|
```
|
|
|
|
and presented base address of the `per-cpu` variables from the data area. So, we know the address of the `irq_stack_union`, `__per_cpu_load` and we know that `init_per_cpu__irq_stack_union` must be placed right after `__per_cpu_load`. And we can see it in the [System.map](http://en.wikipedia.org/wiki/System.map):
|
|
|
|
```
|
|
...
|
|
...
|
|
...
|
|
ffffffff819ed000 D __init_begin
|
|
ffffffff819ed000 D __per_cpu_load
|
|
ffffffff819ed000 A init_per_cpu__irq_stack_union
|
|
...
|
|
...
|
|
...
|
|
```
|
|
|
|
Now we know about `initial_gs`, so let's look at the code:
|
|
|
|
```assembly
|
|
movl $MSR_GS_BASE,%ecx
|
|
movl initial_gs(%rip),%eax
|
|
movl initial_gs+4(%rip),%edx
|
|
wrmsr
|
|
```
|
|
|
|
Here we specified a model specific register with `MSR_GS_BASE`, put the 64-bit address of the `initial_gs` to the `edx:eax` pair and execute the `wrmsr` instruction for filling the `gs` register with the base address of the `init_per_cpu__irq_stack_union` which will be at the bottom of the interrupt stack. After this we will jump to the C code on the `x86_64_start_kernel` from the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head64.c). In the `x86_64_start_kernel` function we do the last preparations before we jump into the generic and architecture-independent kernel code and one of these preparations is filling the early `Interrupt Descriptor Table` with the interrupts handlers entries or `early_idt_handlers`. You can remember it, if you have read the part about the [Early interrupt and exception handling](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) and can remember following code:
|
|
|
|
```C
|
|
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
|
|
set_intr_gate(i, early_idt_handlers[i]);
|
|
|
|
load_idt((const struct desc_ptr *)&idt_descr);
|
|
```
|
|
|
|
but I wrote `Early interrupt and exception handling` part when Linux kernel version was - `3.18`. For this day actual version of the Linux kernel is `4.1.0-rc6+` and ` Andy Lutomirski` sent the [patch](https://lkml.org/lkml/2015/6/2/106) and soon it will be in the mainline kernel that changes behaviour for the `early_idt_handlers`. **NOTE** While I wrote this part the [patch](https://github.com/torvalds/linux/commit/425be5679fd292a3c36cb1fe423086708a99f11a) already turned in the Linux kernel source code. Let's look on it. Now the same part looks like:
|
|
|
|
```C
|
|
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
|
|
set_intr_gate(i, early_idt_handler_array[i]);
|
|
|
|
load_idt((const struct desc_ptr *)&idt_descr);
|
|
```
|
|
|
|
AS you can see it has only one difference in the name of the array of the interrupts handlers entry points. Now it is `early_idt_handler_arry`:
|
|
|
|
```C
|
|
extern const char early_idt_handler_array[NUM_EXCEPTION_VECTORS][EARLY_IDT_HANDLER_SIZE];
|
|
```
|
|
|
|
where `NUM_EXCEPTION_VECTORS` and `EARLY_IDT_HANDLER_SIZE` are defined as:
|
|
|
|
```C
|
|
#define NUM_EXCEPTION_VECTORS 32
|
|
#define EARLY_IDT_HANDLER_SIZE 9
|
|
```
|
|
|
|
So, the `early_idt_handler_array` is an array of the interrupts handlers entry points and contains one entry point on every nine bytes. You can remember that previous `early_idt_handlers` was defined in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head_64.S). The `early_idt_handler_array` is defined in the same source code file too:
|
|
|
|
```assembly
|
|
ENTRY(early_idt_handler_array)
|
|
...
|
|
...
|
|
...
|
|
ENDPROC(early_idt_handler_common)
|
|
```
|
|
|
|
It fills `early_idt_handler_arry` with the `.rept NUM_EXCEPTION_VECTORS` and contains entry of the `early_make_pgtable` interrupt handler (more about its implementation you can read in the part about [Early interrupt and exception handling](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html)). For now we come to the end of the `x86_64` architecture-specific code and the next part is the generic kernel code. Of course you already can know that we will return to the architecture-specific code in the `setup_arch` function and other places, but this is the end of the `x86_64` early code.
|
|
|
|
Setting stack canary for the interrupt stack
|
|
-------------------------------------------------------------------------------
|
|
|
|
The next stop after the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head_64.S) is the biggest `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c). If you've read the previous [chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) about the Linux kernel initialization process, you must remember it. This function does all initialization stuff before kernel will launch first `init` process with the [pid](https://en.wikipedia.org/wiki/Process_identifier) - `1`. The first thing that is related to the interrupts and exceptions handling is the call of the `boot_init_stack_canary` function.
|
|
|
|
This function sets the [canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow#Stack_canaries) value to protect interrupt stack overflow. We already saw a little some details about implementation of the `boot_init_stack_canary` in the previous part and now let's take a closer look on it. You can find implementation of this function in the [arch/x86/include/asm/stackprotector.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/stackprotector.h) and its depends on the `CONFIG_CC_STACKPROTECTOR` kernel configuration option. If this option is not set this function will not do anything:
|
|
|
|
```C
|
|
#ifdef CONFIG_CC_STACKPROTECTOR
|
|
...
|
|
...
|
|
...
|
|
#else
|
|
static inline void boot_init_stack_canary(void)
|
|
{
|
|
}
|
|
#endif
|
|
```
|
|
|
|
If the `CONFIG_CC_STACKPROTECTOR` kernel configuration option is set, the `boot_init_stack_canary` function starts from the check stat `irq_stack_union` that represents [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) interrupt stack has offset equal to forty bytes from the `stack_canary` value:
|
|
|
|
```C
|
|
#ifdef CONFIG_X86_64
|
|
BUILD_BUG_ON(offsetof(union irq_stack_union, stack_canary) != 40);
|
|
#endif
|
|
```
|
|
|
|
As we can read in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-1.html) the `irq_stack_union` represented by the following union:
|
|
|
|
```C
|
|
union irq_stack_union {
|
|
char irq_stack[IRQ_STACK_SIZE];
|
|
|
|
struct {
|
|
char gs_base[40];
|
|
unsigned long stack_canary;
|
|
};
|
|
};
|
|
```
|
|
|
|
which defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/processor.h). We know that [union](http://en.wikipedia.org/wiki/Union_type) in the [C](http://en.wikipedia.org/wiki/C_%28programming_language%29) programming language is a data structure which stores only one field in a memory. We can see here that structure has first field - `gs_base` which is 40 bytes size and represents bottom of the `irq_stack`. So, after this our check with the `BUILD_BUG_ON` macro should end successfully. (you can read the first part about Linux kernel initialization [process](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) if you're interesting about the `BUILD_BUG_ON` macro).
|
|
|
|
After this we calculate new `canary` value based on the random number and [Time Stamp Counter](http://en.wikipedia.org/wiki/Time_Stamp_Counter):
|
|
|
|
```C
|
|
get_random_bytes(&canary, sizeof(canary));
|
|
tsc = __native_read_tsc();
|
|
canary += tsc + (tsc << 32UL);
|
|
```
|
|
|
|
and write `canary` value to the `irq_stack_union` with the `this_cpu_write` macro:
|
|
|
|
```C
|
|
this_cpu_write(irq_stack_union.stack_canary, canary);
|
|
```
|
|
|
|
more about `this_cpu_*` operation you can read in the [Linux kernel documentation](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/this_cpu_ops.txt).
|
|
|
|
Disabling/Enabling local interrupts
|
|
--------------------------------------------------------------------------------
|
|
|
|
The next step in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) which is related to the interrupts and interrupts handling after we have set the `canary` value to the interrupt stack - is the call of the `local_irq_disable` macro.
|
|
|
|
This macro defined in the [include/linux/irqflags.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/irqflags.h) header file and as you can understand, we can disable interrupts for the CPU with the call of this macro. Let's look on its implementation. First of all note that it depends on the `CONFIG_TRACE_IRQFLAGS_SUPPORT` kernel configuration option:
|
|
|
|
```C
|
|
#ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT
|
|
...
|
|
#define local_irq_disable() \
|
|
do { raw_local_irq_disable(); trace_hardirqs_off(); } while (0)
|
|
...
|
|
#else
|
|
...
|
|
#define local_irq_disable() do { raw_local_irq_disable(); } while (0)
|
|
...
|
|
#endif
|
|
```
|
|
|
|
They are both similar and as you can see have only one difference: the `local_irq_disable` macro contains call of the `trace_hardirqs_off` when `CONFIG_TRACE_IRQFLAGS_SUPPORT` is enabled. There is special feature in the [lockdep](http://lwn.net/Articles/321663/) subsystem - `irq-flags tracing` for tracing `hardirq` and `softirq` state. In our case `lockdep` subsystem can give us interesting information about hard/soft irqs on/off events which are occurs in the system. The `trace_hardirqs_off` function defined in the [kernel/locking/lockdep.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/locking/lockdep.c):
|
|
|
|
```C
|
|
void trace_hardirqs_off(void)
|
|
{
|
|
trace_hardirqs_off_caller(CALLER_ADDR0);
|
|
}
|
|
EXPORT_SYMBOL(trace_hardirqs_off);
|
|
```
|
|
|
|
and just calls `trace_hardirqs_off_caller` function. The `trace_hardirqs_off_caller` checks the `hardirqs_enabled` field of the current process and increases the `redundant_hardirqs_off` if call of the `local_irq_disable` was redundant or the `hardirqs_off_events` if it was not. These two fields and other `lockdep` statistic related fields are defined in the [kernel/locking/lockdep_insides.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/locking/lockdep_insides.h) and located in the `lockdep_stats` structure:
|
|
|
|
```C
|
|
struct lockdep_stats {
|
|
...
|
|
...
|
|
...
|
|
int softirqs_off_events;
|
|
int redundant_softirqs_off;
|
|
...
|
|
...
|
|
...
|
|
}
|
|
```
|
|
|
|
If you will set `CONFIG_DEBUG_LOCKDEP` kernel configuration option, the `lockdep_stats_debug_show` function will write all tracing information to the `/proc/lockdep`:
|
|
|
|
```C
|
|
static void lockdep_stats_debug_show(struct seq_file *m)
|
|
{
|
|
#ifdef CONFIG_DEBUG_LOCKDEP
|
|
unsigned long long hi1 = debug_atomic_read(hardirqs_on_events),
|
|
hi2 = debug_atomic_read(hardirqs_off_events),
|
|
hr1 = debug_atomic_read(redundant_hardirqs_on),
|
|
...
|
|
...
|
|
...
|
|
seq_printf(m, " hardirq on events: %11llu\n", hi1);
|
|
seq_printf(m, " hardirq off events: %11llu\n", hi2);
|
|
seq_printf(m, " redundant hardirq ons: %11llu\n", hr1);
|
|
#endif
|
|
}
|
|
```
|
|
|
|
and you can see its result with the:
|
|
|
|
```
|
|
$ sudo cat /proc/lockdep
|
|
hardirq on events: 12838248974
|
|
hardirq off events: 12838248979
|
|
redundant hardirq ons: 67792
|
|
redundant hardirq offs: 3836339146
|
|
softirq on events: 38002159
|
|
softirq off events: 38002187
|
|
redundant softirq ons: 0
|
|
redundant softirq offs: 0
|
|
```
|
|
|
|
Ok, now we know a little about tracing, but more info will be in the separate part about `lockdep` and `tracing`. You can see that the both `local_disable_irq` macros have the same part - `raw_local_irq_disable`. This macro defined in the [arch/x86/include/asm/irqflags.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/irqflags.h) and expands to the call of the:
|
|
|
|
```C
|
|
static inline void native_irq_disable(void)
|
|
{
|
|
asm volatile("cli": : :"memory");
|
|
}
|
|
```
|
|
|
|
And you already must remember that `cli` instruction clears the [IF](http://en.wikipedia.org/wiki/Interrupt_flag) flag which determines ability of a processor to handle an interrupt or an exception. Besides the `local_irq_disable`, as you already can know there is an inverse macro - `local_irq_enable`. This macro has the same tracing mechanism and very similar on the `local_irq_enable`, but as you can understand from its name, it enables interrupts with the `sti` instruction:
|
|
|
|
```C
|
|
static inline void native_irq_enable(void)
|
|
{
|
|
asm volatile("sti": : :"memory");
|
|
}
|
|
```
|
|
|
|
Now we know how `local_irq_disable` and `local_irq_enable` work. It was the first call of the `local_irq_disable` macro, but we will meet these macros many times in the Linux kernel source code. But for now we are in the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) and we just disabled `local` interrupts. Why local and why we did it? Previously kernel provided a method to disable interrupts on all processors and it was called `cli`. This function was [removed](https://lwn.net/Articles/291956/) and now we have `local_irq_{enabled,disable}` to disable or enable interrupts on the current processor. After we've disabled the interrupts with the `local_irq_disable` macro, we set the:
|
|
|
|
```C
|
|
early_boot_irqs_disabled = true;
|
|
```
|
|
|
|
The `early_boot_irqs_disabled` variable defined in the [include/linux/kernel.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/kernel.h):
|
|
|
|
```C
|
|
extern bool early_boot_irqs_disabled;
|
|
```
|
|
|
|
and used in the different places. For example it used in the `smp_call_function_many` function from the [kernel/smp.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/smp.c) for the checking possible deadlock when interrupts are disabled:
|
|
|
|
```C
|
|
WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
|
|
&& !oops_in_progress && !early_boot_irqs_disabled);
|
|
```
|
|
|
|
Early trap initialization during kernel initialization
|
|
--------------------------------------------------------------------------------
|
|
|
|
The next functions after the `local_disable_irq` are `boot_cpu_init` and `page_address_init`, but they are not related to the interrupts and exceptions (more about this functions you can read in the chapter about Linux kernel [initialization process](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html)). The next is the `setup_arch` function. As you can remember this function located in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel.setup.c) source code file and makes initialization of many different architecture-dependent [stuff](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html). The first interrupts related function which we can see in the `setup_arch` is the - `early_trap_init` function. This function defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/traps.c) and fills `Interrupt Descriptor Table` with the couple of entries:
|
|
|
|
```C
|
|
void __init early_trap_init(void)
|
|
{
|
|
set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
|
|
set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
|
|
#ifdef CONFIG_X86_32
|
|
set_intr_gate(X86_TRAP_PF, page_fault);
|
|
#endif
|
|
load_idt(&idt_descr);
|
|
}
|
|
```
|
|
|
|
Here we can see calls of three different functions:
|
|
|
|
* `set_intr_gate_ist`
|
|
* `set_system_intr_gate_ist`
|
|
* `set_intr_gate`
|
|
|
|
All of these functions defined in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/desc.h) and do the similar thing but not the same. The first `set_intr_gate_ist` function inserts new an interrupt gate in the `IDT`. Let's look on its implementation:
|
|
|
|
```C
|
|
static inline void set_intr_gate_ist(int n, void *addr, unsigned ist)
|
|
{
|
|
BUG_ON((unsigned)n > 0xFF);
|
|
_set_gate(n, GATE_INTERRUPT, addr, 0, ist, __KERNEL_CS);
|
|
}
|
|
```
|
|
|
|
First of all we can see the check that `n` which is [vector number](http://en.wikipedia.org/wiki/Interrupt_vector_table) of the interrupt is not greater than `0xff` or 255. We need to check it because we remember from the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-1.html) that vector number of an interrupt must be between `0` and `255`. In the next step we can see the call of the `_set_gate` function that sets a given interrupt gate to the `IDT` table:
|
|
|
|
```C
|
|
static inline void _set_gate(int gate, unsigned type, void *addr,
|
|
unsigned dpl, unsigned ist, unsigned seg)
|
|
{
|
|
gate_desc s;
|
|
|
|
pack_gate(&s, type, (unsigned long)addr, dpl, ist, seg);
|
|
write_idt_entry(idt_table, gate, &s);
|
|
write_trace_idt_entry(gate, &s);
|
|
}
|
|
```
|
|
|
|
Here we start from the `pack_gate` function which takes clean `IDT` entry represented by the `gate_desc` structure and fills it with the base address and limit, [Interrupt Stack Table](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks), [Privilege level](http://en.wikipedia.org/wiki/Privilege_level), type of an interrupt which can be one of the following values:
|
|
|
|
* `GATE_INTERRUPT`
|
|
* `GATE_TRAP`
|
|
* `GATE_CALL`
|
|
* `GATE_TASK`
|
|
|
|
and set the present bit for the given `IDT` entry:
|
|
|
|
```C
|
|
static inline void pack_gate(gate_desc *gate, unsigned type, unsigned long func,
|
|
unsigned dpl, unsigned ist, unsigned seg)
|
|
{
|
|
gate->offset_low = PTR_LOW(func);
|
|
gate->segment = __KERNEL_CS;
|
|
gate->ist = ist;
|
|
gate->p = 1;
|
|
gate->dpl = dpl;
|
|
gate->zero0 = 0;
|
|
gate->zero1 = 0;
|
|
gate->type = type;
|
|
gate->offset_middle = PTR_MIDDLE(func);
|
|
gate->offset_high = PTR_HIGH(func);
|
|
}
|
|
```
|
|
|
|
After this we write just filled interrupt gate to the `IDT` with the `write_idt_entry` macro which expands to the `native_write_idt_entry` and just copy the interrupt gate to the `idt_table` table by the given index:
|
|
|
|
```C
|
|
#define write_idt_entry(dt, entry, g) native_write_idt_entry(dt, entry, g)
|
|
|
|
static inline void native_write_idt_entry(gate_desc *idt, int entry, const gate_desc *gate)
|
|
{
|
|
memcpy(&idt[entry], gate, sizeof(*gate));
|
|
}
|
|
```
|
|
|
|
where `idt_table` is just array of `gate_desc`:
|
|
|
|
```C
|
|
extern gate_desc idt_table[];
|
|
```
|
|
|
|
That's all. The second `set_system_intr_gate_ist` function has only one difference from the `set_intr_gate_ist`:
|
|
|
|
```C
|
|
static inline void set_system_intr_gate_ist(int n, void *addr, unsigned ist)
|
|
{
|
|
BUG_ON((unsigned)n > 0xFF);
|
|
_set_gate(n, GATE_INTERRUPT, addr, 0x3, ist, __KERNEL_CS);
|
|
}
|
|
```
|
|
|
|
Do you see it? Look on the fourth parameter of the `_set_gate`. It is `0x3`. In the `set_intr_gate` it was `0x0`. We know that this parameter represent `DPL` or privilege level. We also know that `0` is the highest privilege level and `3` is the lowest.Now we know how `set_system_intr_gate_ist`, `set_intr_gate_ist`, `set_intr_gate` are work and we can return to the `early_trap_init` function. Let's look on it again:
|
|
|
|
```C
|
|
set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
|
|
set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
|
|
```
|
|
|
|
We set two `IDT` entries for the `#DB` interrupt and `int3`. These functions takes the same set of parameters:
|
|
|
|
* vector number of an interrupt;
|
|
* address of an interrupt handler;
|
|
* interrupt stack table index.
|
|
|
|
That's all. More about interrupts and handlers you will know in the next parts.
|
|
|
|
Conclusion
|
|
--------------------------------------------------------------------------------
|
|
|
|
It is the end of the second part about interrupts and interrupt handling in the Linux kernel. We saw the some theory in the previous part and started to dive into interrupts and exceptions handling in the current part. We have started from the earliest parts in the Linux kernel source code which are related to the interrupts. In the next part we will continue to dive into this interesting theme and will know more about interrupt handling process.
|
|
|
|
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
|
|
|
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
|
|
|
Links
|
|
--------------------------------------------------------------------------------
|
|
|
|
* [IDT](http://en.wikipedia.org/wiki/Interrupt_descriptor_table)
|
|
* [Protected mode](http://en.wikipedia.org/wiki/Protected_mode)
|
|
* [List of x86 calling conventions](http://en.wikipedia.org/wiki/X86_calling_conventions#List_of_x86_calling_conventions)
|
|
* [8086](http://en.wikipedia.org/wiki/Intel_8086)
|
|
* [Long mode](http://en.wikipedia.org/wiki/Long_mode)
|
|
* [NX](http://en.wikipedia.org/wiki/NX_bit)
|
|
* [Extended Feature Enable Register](http://en.wikipedia.org/wiki/Control_register#Additional_Control_registers_in_x86-64_series)
|
|
* [Model-specific register](http://en.wikipedia.org/wiki/Model-specific_register)
|
|
* [Process identifier](https://en.wikipedia.org/wiki/Process_identifier)
|
|
* [lockdep](http://lwn.net/Articles/321663/)
|
|
* [irqflags tracing](https://www.kernel.org/doc/Documentation/irqflags-tracing.txt)
|
|
* [IF](http://en.wikipedia.org/wiki/Interrupt_flag)
|
|
* [Stack canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow#Stack_canaries)
|
|
* [Union type](http://en.wikipedia.org/wiki/Union_type)
|
|
* [this_cpu_* operations](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/this_cpu_ops.txt)
|
|
* [vector number](http://en.wikipedia.org/wiki/Interrupt_vector_table)
|
|
* [Interrupt Stack Table](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks)
|
|
* [Privilege level](http://en.wikipedia.org/wiki/Privilege_level)
|
|
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-1.html)
|