Shorten the sentence to reduce repetition, as the correlation with the boot chapter is already explained in the previous sentence. Add 'function' to `protected_mode_jump` s/We already know from the earliest parts that entry to protected mode is located in the `boot_params.hdr.code32_start` and you can see that we pass the entry of the protected mode and `boot_params` to the `protected_mode_jump` /The entry to protected mode is located in the `boot_params.hdr.code32_start` and passed together with the `boot_params` to the `protected_mode_jump` function/ Correct grammatical errors: s/in the end of the/at the end of/ Improve word choices: s/gets these two parameters in the/receives these two parameters within the/ Add missing punctuation and add filling words for a more fluent reading. This commit introduced a new way of annotating functions: (https://lkml.org/lkml/2019/1/30/374) s/GLOBAL(protected_mode_jump)/SYM_FUNC_START_NOALIGN(protected_mode_jump)/ s/ENDPROC(protected_mode_jump)/SYM_FUNC_END(protected_mode_jump)/ s/GLOBAL(in_pm32)/SYM_FUNC_START_LOCAL_NOALIGN(.Lin_pm32)/ s/ENDPROC(in_pm32)/SYM_FUNC_END(.Lin_pm32)/ This commit made in_pm32 local: (https://lkml.org/lkml/2019/10/11/192) s/2: .long in_pm32/2: .long .Lin_pm32/ Signed-off-by: Sebastian Fricke <sebastian.fricke.linux@gmail.com>
32 KiB
Interrupts and Interrupt Handling. Part 2.
Start to dive into interrupt and exceptions handling in the Linux kernel
We saw some theory about interrupts and exception handling in the introduction and as I mentioned in that part, we will now start to dive into interrupts and exceptions within the Linux kernel source code. We'll commence by initializing the basic components as we did in the other chapters. But, we will not see the Linux kernel source code from the very early code lines, as this was presented in the example within the Linux kernel booting process chapter. In the beginning we will deal with the first sections of the Linux kernel source code, which are related to interrupts and exceptions.
If you've read the previous parts, you can remember that the earliest place in the Linux kernel x86_64
architecture-specific source code, which is related to the interrupt is located in the arch/x86/boot/pm.c source code file and represents the first setup of the Interrupt Descriptor Table. It occurs right before the transition into the protected mode in the go_to_protected_mode
function by calling setup_idt
:
void go_to_protected_mode(void)
{
...
setup_idt();
...
}
The setup_idt
function is defined in the same source code file as the go_to_protected_mode
function and just loads the address of the NULL
interrupt descriptor table:
static void setup_idt(void)
{
static const struct gdt_ptr null_idt = {0, 0};
asm volatile("lidtl %0" : : "m" (null_idt));
}
where gdt_ptr
represents a special 48-bit GDTR
register, which must contain the base address of the Global Descriptor Table
:
struct gdt_ptr {
u16 len;
u32 ptr;
} __attribute__((packed));
Of course in our case the gdt_ptr
does not represent the GDTR
register, but IDTR
since we set the Interrupt Descriptor Table
. You will not find an idt_ptr
structure, because if it had been in the Linux kernel source code, it would have been the same as a gdt_ptr
but with a different name. It would make no sense to create two structures that only differ in their names. Note here that we do not fill the Interrupt Descriptor Table
with entries, because it is too early to handle any interrupts or exceptions at this point. That's why we just fill the IDT
with NULL
.
After the setup of the Interrupt descriptor table, Global Descriptor Table and other stuff we jump into protected mode in the - arch/x86/boot/pmjump.S file. You can read more about it in the part, which describes the transition to protected mode.
The entry to protected mode is located in the boot_params.hdr.code32_start
and passed together with the boot_params
to the protected_mode_jump
function at the end of arch/x86/boot/pm.c:
protected_mode_jump(boot_params.hdr.code32_start,
(u32)&boot_params + (ds() << 4));
The protected_mode_jump
function is defined at arch/x86/boot/pmjump.S and receives these two parameters within the ax
and dx
registers, using one of the 8086 calling conventions:
SYM_FUNC_START_NOALIGN(protected_mode_jump)
...
...
...
.byte 0x66, 0xea # ljmpl opcode
2: .long .Lin_pm32 # offset
.word __BOOT_CS # segment
SYM_FUNC_END(protected_mode_jump)
where in_pm32
contains a jump to the 32-bit entry point:
SYM_FUNC_START_LOCAL_NOALIGN(.Lin_pm32)
...
...
jmpl *%eax # Jump to the 32-bit entrypoint
SYM_FUNC_END(.Lin_pm32)
As you can remember the 32-bit entry point is in the arch/x86/boot/compressed/head_64.S assembly file, although it contains _64
in its name. We can see the two similar files in the arch/x86/boot/compressed
directory:
arch/x86/boot/compressed/head_32.S
.arch/x86/boot/compressed/head_64.S
;
But the 32-bit mode entry point is the second file in our case. The first file is not even compiled for x86_64
. Let's look at the arch/x86/boot/compressed/Makefile:
vmlinux-objs-y := $(obj)/vmlinux.lds $(obj)/head_$(BITS).o $(obj)/misc.o \
...
...
We can see here that head_*
depends on the $(BITS)
variable which depends on the architecture. You can find it in the arch/x86/Makefile:
ifeq ($(CONFIG_X86_32),y)
...
BITS := 32
else
BITS := 64
...
endif
Now as we jumped on the startup_32
from the arch/x86/boot/compressed/head_64.S we will not find anything related to the interrupt handling here. The startup_32
contains code that makes preparations before the transition into long mode and directly jumps in to it. The long mode
entry is located in startup_64
and it makes preparations before the kernel decompression that occurs in the decompress_kernel
from the arch/x86/boot/compressed/misc.c. After the kernel is decompressed, we jump on the startup_64
from the arch/x86/kernel/head_64.S. In the startup_64
we start to build identity-mapped pages. After we have built identity-mapped pages, checked the NX bit, setup the Extended Feature Enable Register
(see in links), and updated the early Global Descriptor Table
with the lgdt
instruction, we need to setup gs
register with the following code:
movl $MSR_GS_BASE,%ecx
movl initial_gs(%rip),%eax
movl initial_gs+4(%rip),%edx
wrmsr
We already saw this code in the previous part. First of all pay attention on the last wrmsr
instruction. This instruction writes data from the edx:eax
registers to the model specific register specified by the ecx
register. We can see that ecx
contains $MSR_GS_BASE
which is declared in the arch/x86/include/uapi/asm/msr-index.h and looks like:
#define MSR_GS_BASE 0xc0000101
From this we can understand that MSR_GS_BASE
defines the number of the model specific register
. Since registers cs
, ds
, es
, and ss
are not used in the 64-bit mode, their fields are ignored. But we can access memory over fs
and gs
registers. The model specific register provides a back door
to the hidden parts of these segment registers and allows to use 64-bit base address for segment register addressed by the fs
and gs
. So the MSR_GS_BASE
is the hidden part and this part is mapped on the GS.base
field. Let's look on the initial_gs
:
GLOBAL(initial_gs)
.quad INIT_PER_CPU_VAR(irq_stack_union)
We pass irq_stack_union
symbol to the INIT_PER_CPU_VAR
macro which just concatenates the init_per_cpu__
prefix with the given symbol. In our case we will get the init_per_cpu__irq_stack_union
symbol. Let's look at the linker script. There we can see following definition:
#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load
INIT_PER_CPU(irq_stack_union);
It tells us that the address of the init_per_cpu__irq_stack_union
will be irq_stack_union + __per_cpu_load
. Now we need to understand where init_per_cpu__irq_stack_union
and __per_cpu_load
are what they mean. The first irq_stack_union
is defined in the arch/x86/include/asm/processor.h with the DECLARE_INIT_PER_CPU
macro which expands to call the init_per_cpu_var
macro:
DECLARE_INIT_PER_CPU(irq_stack_union);
#define DECLARE_INIT_PER_CPU(var) \
extern typeof(per_cpu_var(var)) init_per_cpu_var(var)
#define init_per_cpu_var(var) init_per_cpu__##var
If we expand all macros we will get the same init_per_cpu__irq_stack_union
as we got after expanding the INIT_PER_CPU
macro, but you can note that it is not just a symbol, but a variable. Let's look at the typeof(per_cpu_var(var))
expression. Our var
is irq_stack_union
and the per_cpu_var
macro is defined in the arch/x86/include/asm/percpu.h:
#define PER_CPU_VAR(var) %__percpu_seg:var
where:
#ifdef CONFIG_X86_64
#define __percpu_seg gs
endif
So, we are accessing gs:irq_stack_union
and getting its type which is irq_union
. Ok, we defined the first variable and know its address, now let's look at the second __per_cpu_load
symbol. There are a couple of per-cpu
variables which are located after this symbol. The __per_cpu_load
is defined in the include/asm-generic/sections.h:
extern char __per_cpu_load[], __per_cpu_start[], __per_cpu_end[];
and presented base address of the per-cpu
variables from the data area. So, we know the address of the irq_stack_union
, __per_cpu_load
and we know that init_per_cpu__irq_stack_union
must be placed right after __per_cpu_load
. And we can see it in the System.map:
...
...
...
ffffffff819ed000 D __init_begin
ffffffff819ed000 D __per_cpu_load
ffffffff819ed000 A init_per_cpu__irq_stack_union
...
...
...
Now we know about initial_gs
, so let's look at the code:
movl $MSR_GS_BASE,%ecx
movl initial_gs(%rip),%eax
movl initial_gs+4(%rip),%edx
wrmsr
Here we specified a model specific register with MSR_GS_BASE
, put the 64-bit address of the initial_gs
to the edx:eax
pair and execute the wrmsr
instruction for filling the gs
register with the base address of the init_per_cpu__irq_stack_union
which will be at the bottom of the interrupt stack. After this we will jump to the C code on the x86_64_start_kernel
from the arch/x86/kernel/head64.c. In the x86_64_start_kernel
function we do the last preparations before we jump into the generic and architecture-independent kernel code and one of these preparations is filling the early Interrupt Descriptor Table
with the interrupts handlers entries or early_idt_handlers
. You can remember it, if you have read the part about the Early interrupt and exception handling and can remember following code:
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
set_intr_gate(i, early_idt_handlers[i]);
load_idt((const struct desc_ptr *)&idt_descr);
but I wrote Early interrupt and exception handling
part when Linux kernel version was - 3.18
. For this day actual version of the Linux kernel is 4.1.0-rc6+
and Andy Lutomirski
sent the patch and soon it will be in the mainline kernel that changes behaviour for the early_idt_handlers
. NOTE While I wrote this part the patch already turned in the Linux kernel source code. Let's look on it. Now the same part looks like:
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
set_intr_gate(i, early_idt_handler_array[i]);
load_idt((const struct desc_ptr *)&idt_descr);
AS you can see it has only one difference in the name of the array of the interrupts handlers entry points. Now it is early_idt_handler_array
:
extern const char early_idt_handler_array[NUM_EXCEPTION_VECTORS][EARLY_IDT_HANDLER_SIZE];
where NUM_EXCEPTION_VECTORS
and EARLY_IDT_HANDLER_SIZE
are defined as:
#define NUM_EXCEPTION_VECTORS 32
#define EARLY_IDT_HANDLER_SIZE 9
So, the early_idt_handler_array
is an array of the interrupts handlers entry points and contains one entry point on every nine bytes. You can remember that previous early_idt_handlers
was defined in the arch/x86/kernel/head_64.S. The early_idt_handler_array
is defined in the same source code file too:
ENTRY(early_idt_handler_array)
...
...
...
ENDPROC(early_idt_handler_common)
It fills early_idt_handler_array
with the .rept NUM_EXCEPTION_VECTORS
and contains entry of the early_make_pgtable
interrupt handler (you can read more about its implementation in the part about Early interrupt and exception handling). For now, we have reached the end of the x86_64 architecture-specific code and the next part is the generic kernel code. You probably already know, that we will return to the architecture-specific code in the setup_arch
function and other places, but this is the end of the x86_64
early code.
Setting stack canary for the interrupt stack
The next stop after the arch/x86/kernel/head_64.S is the biggest start_kernel
function from the init/main.c. If you've read the previous chapter about the Linux kernel initialization process, you must remember it. This function does all initialization stuff before kernel will launch first init
process with the pid - 1
. The first thing that is related to the interrupts and exceptions handling is the call of the boot_init_stack_canary
function.
This function sets the canary value to protect interrupt stack overflow. We already saw a little some details about implementation of the boot_init_stack_canary
in the previous part and now let's take a closer look on it. You can find implementation of this function in the arch/x86/include/asm/stackprotector.h and its depends on the CONFIG_CC_STACKPROTECTOR
kernel configuration option. If this option is not set this function will not do anything:
#ifdef CONFIG_CC_STACKPROTECTOR
...
...
...
#else
static inline void boot_init_stack_canary(void)
{
}
#endif
If the CONFIG_CC_STACKPROTECTOR
kernel configuration option is set, the boot_init_stack_canary
function starts from the check stat irq_stack_union
that represents per-cpu interrupt stack has offset equal to forty bytes from the stack_canary
value:
#ifdef CONFIG_X86_64
BUILD_BUG_ON(offsetof(union irq_stack_union, stack_canary) != 40);
#endif
As we can read in the previous part the irq_stack_union
represented by the following union:
union irq_stack_union {
char irq_stack[IRQ_STACK_SIZE];
struct {
char gs_base[40];
unsigned long stack_canary;
};
};
which defined in the arch/x86/include/asm/processor.h. We know that union in the C programming language is a data structure which stores only one field in a memory. We can see here that structure has first field - gs_base
which is 40 bytes size and represents bottom of the irq_stack
. So, after this our check with the BUILD_BUG_ON
macro should end successfully. (you can read the first part about Linux kernel initialization process if you're interesting about the BUILD_BUG_ON
macro).
After this we calculate new canary
value based on the random number and Time Stamp Counter:
get_random_bytes(&canary, sizeof(canary));
tsc = __native_read_tsc();
canary += tsc + (tsc << 32UL);
and write canary
value to the irq_stack_union
with the this_cpu_write
macro:
this_cpu_write(irq_stack_union.stack_canary, canary);
more about this_cpu_*
operation you can read in the Linux kernel documentation.
Disabling/Enabling local interrupts
The next step in the init/main.c which is related to the interrupts and interrupts handling after we have set the canary
value to the interrupt stack - is the call of the local_irq_disable
macro.
This macro defined in the include/linux/irqflags.h header file and as you can understand, we can disable interrupts for the CPU with the call of this macro. Let's look on its implementation. First of all note that it depends on the CONFIG_TRACE_IRQFLAGS_SUPPORT
kernel configuration option:
#ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT
...
#define local_irq_disable() \
do { raw_local_irq_disable(); trace_hardirqs_off(); } while (0)
...
#else
...
#define local_irq_disable() do { raw_local_irq_disable(); } while (0)
...
#endif
They are both similar and as you can see have only one difference: the local_irq_disable
macro contains call of the trace_hardirqs_off
when CONFIG_TRACE_IRQFLAGS_SUPPORT
is enabled. There is special feature in the lockdep subsystem - irq-flags tracing
for tracing hardirq
and softirq
state. In our case lockdep
subsystem can give us interesting information about hard/soft irqs on/off events which are occurs in the system. The trace_hardirqs_off
function defined in the kernel/locking/lockdep.c:
void trace_hardirqs_off(void)
{
trace_hardirqs_off_caller(CALLER_ADDR0);
}
EXPORT_SYMBOL(trace_hardirqs_off);
and just calls trace_hardirqs_off_caller
function. The trace_hardirqs_off_caller
checks the hardirqs_enabled
field of the current process and increases the redundant_hardirqs_off
if call of the local_irq_disable
was redundant or the hardirqs_off_events
if it was not. These two fields and other lockdep
statistic related fields are defined in the kernel/locking/lockdep_insides.h and located in the lockdep_stats
structure:
struct lockdep_stats {
...
...
...
int softirqs_off_events;
int redundant_softirqs_off;
...
...
...
}
If you will set CONFIG_DEBUG_LOCKDEP
kernel configuration option, the lockdep_stats_debug_show
function will write all tracing information to the /proc/lockdep
:
static void lockdep_stats_debug_show(struct seq_file *m)
{
#ifdef CONFIG_DEBUG_LOCKDEP
unsigned long long hi1 = debug_atomic_read(hardirqs_on_events),
hi2 = debug_atomic_read(hardirqs_off_events),
hr1 = debug_atomic_read(redundant_hardirqs_on),
...
...
...
seq_printf(m, " hardirq on events: %11llu\n", hi1);
seq_printf(m, " hardirq off events: %11llu\n", hi2);
seq_printf(m, " redundant hardirq ons: %11llu\n", hr1);
#endif
}
and you can see its result with the:
$ sudo cat /proc/lockdep
hardirq on events: 12838248974
hardirq off events: 12838248979
redundant hardirq ons: 67792
redundant hardirq offs: 3836339146
softirq on events: 38002159
softirq off events: 38002187
redundant softirq ons: 0
redundant softirq offs: 0
Ok, now we know a little about tracing, but more info will be in the separate part about lockdep
and tracing
. You can see that the both local_disable_irq
macros have the same part - raw_local_irq_disable
. This macro defined in the arch/x86/include/asm/irqflags.h and expands to the call of the:
static inline void native_irq_disable(void)
{
asm volatile("cli": : :"memory");
}
And you already must remember that cli
instruction clears the IF flag which determines ability of a processor to handle an interrupt or an exception. Besides the local_irq_disable
, as you already can know there is an inverse macro - local_irq_enable
. This macro has the same tracing mechanism and very similar on the local_irq_enable
, but as you can understand from its name, it enables interrupts with the sti
instruction:
static inline void native_irq_enable(void)
{
asm volatile("sti": : :"memory");
}
Now we know how local_irq_disable
and local_irq_enable
work. It was the first call of the local_irq_disable
macro, but we will meet these macros many times in the Linux kernel source code. But for now we are in the start_kernel
function from the init/main.c and we just disabled local
interrupts. Why local and why we did it? Previously kernel provided a method to disable interrupts on all processors and it was called cli
. This function was removed and now we have local_irq_{enabled,disable}
to disable or enable interrupts on the current processor. After we've disabled the interrupts with the local_irq_disable
macro, we set the:
early_boot_irqs_disabled = true;
The early_boot_irqs_disabled
variable defined in the include/linux/kernel.h:
extern bool early_boot_irqs_disabled;
and used in the different places. For example it used in the smp_call_function_many
function from the kernel/smp.c for the checking possible deadlock when interrupts are disabled:
WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
&& !oops_in_progress && !early_boot_irqs_disabled);
Early trap initialization during kernel initialization
The next functions after the local_disable_irq
are boot_cpu_init
and page_address_init
, but they are not related to the interrupts and exceptions (more about this functions you can read in the chapter about Linux kernel initialization process). The next is the setup_arch
function. As you can remember this function located in the arch/x86/kernel/setup.c source code file and makes initialization of many different architecture-dependent stuff. The first interrupts related function which we can see in the setup_arch
is the - early_trap_init
function. This function defined in the arch/x86/kernel/traps.c and fills Interrupt Descriptor Table
with the couple of entries:
void __init early_trap_init(void)
{
set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
#ifdef CONFIG_X86_32
set_intr_gate(X86_TRAP_PF, page_fault);
#endif
load_idt(&idt_descr);
}
Here we can see calls of three different functions:
set_intr_gate_ist
set_system_intr_gate_ist
set_intr_gate
All of these functions defined in the arch/x86/include/asm/desc.h and do the similar thing but not the same. The first set_intr_gate_ist
function inserts new an interrupt gate in the IDT
. Let's look on its implementation:
static inline void set_intr_gate_ist(int n, void *addr, unsigned ist)
{
BUG_ON((unsigned)n > 0xFF);
_set_gate(n, GATE_INTERRUPT, addr, 0, ist, __KERNEL_CS);
}
First of all we can see the check that n
which is vector number of the interrupt is not greater than 0xff
or 255. We need to check it because we remember from the previous part that vector number of an interrupt must be between 0
and 255
. In the next step we can see the call of the _set_gate
function that sets a given interrupt gate to the IDT
table:
static inline void _set_gate(int gate, unsigned type, void *addr,
unsigned dpl, unsigned ist, unsigned seg)
{
gate_desc s;
pack_gate(&s, type, (unsigned long)addr, dpl, ist, seg);
write_idt_entry(idt_table, gate, &s);
write_trace_idt_entry(gate, &s);
}
Here we start from the pack_gate
function which takes clean IDT
entry represented by the gate_desc
structure and fills it with the base address and limit, Interrupt Stack Table, Privilege level, type of an interrupt which can be one of the following values:
GATE_INTERRUPT
GATE_TRAP
GATE_CALL
GATE_TASK
and set the present bit for the given IDT
entry:
static inline void pack_gate(gate_desc *gate, unsigned type, unsigned long func,
unsigned dpl, unsigned ist, unsigned seg)
{
gate->offset_low = PTR_LOW(func);
gate->segment = __KERNEL_CS;
gate->ist = ist;
gate->p = 1;
gate->dpl = dpl;
gate->zero0 = 0;
gate->zero1 = 0;
gate->type = type;
gate->offset_middle = PTR_MIDDLE(func);
gate->offset_high = PTR_HIGH(func);
}
After this we write just filled interrupt gate to the IDT
with the write_idt_entry
macro which expands to the native_write_idt_entry
and just copy the interrupt gate to the idt_table
table by the given index:
#define write_idt_entry(dt, entry, g) native_write_idt_entry(dt, entry, g)
static inline void native_write_idt_entry(gate_desc *idt, int entry, const gate_desc *gate)
{
memcpy(&idt[entry], gate, sizeof(*gate));
}
where idt_table
is just array of gate_desc
:
extern gate_desc idt_table[];
That's all. The second set_system_intr_gate_ist
function has only one difference from the set_intr_gate_ist
:
static inline void set_system_intr_gate_ist(int n, void *addr, unsigned ist)
{
BUG_ON((unsigned)n > 0xFF);
_set_gate(n, GATE_INTERRUPT, addr, 0x3, ist, __KERNEL_CS);
}
Do you see it? Look on the fourth parameter of the _set_gate
. It is 0x3
. In the set_intr_gate
it was 0x0
. We know that this parameter represent DPL
or privilege level. We also know that 0
is the highest privilege level and 3
is the lowest.Now we know how set_system_intr_gate_ist
, set_intr_gate_ist
, set_intr_gate
are work and we can return to the early_trap_init
function. Let's look on it again:
set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
We set two IDT
entries for the #DB
interrupt and int3
. These functions takes the same set of parameters:
- vector number of an interrupt;
- address of an interrupt handler;
- interrupt stack table index.
That's all. More about interrupts and handlers you will know in the next parts.
Conclusion
It is the end of the second part about interrupts and interrupt handling in the Linux kernel. We saw the some theory in the previous part and started to dive into interrupts and exceptions handling in the current part. We have started from the earliest parts in the Linux kernel source code which are related to the interrupts. In the next part we will continue to dive into this interesting theme and will know more about interrupt handling process.
If you have any questions or suggestions write me a comment or ping me at twitter.
Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.