Update Page fault handling paragraph
28 KiB
Kernel initialization. Part 2.
Early interrupt and exception handling
In the previous part we stopped before setting of early interrupt handlers. At this moment we are in the decompressed Linux kernel, we have basic paging structure for early boot and our current goal is to finish early preparation before the main kernel code will start to work.
We already started to do this preparation in the previous first part of this chapter. We continue in this part and will know more about interrupt and exception handling.
Remember that we stopped before following loop:
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
set_intr_gate(i, early_idt_handler_array[i]);
from the arch/x86/kernel/head64.c source code file. But before we started to sort out this code, we need to know about interrupts and handlers.
Some theory
An interrupt is an event caused by software or hardware to the CPU. For example a user have pressed a key on keyboard. On interrupt, CPU stops the current task and transfer control to the special routine which is called - interrupt handler. An interrupt handler handles and interrupt and transfer control back to the previously stopped task. We can split interrupts on three types:
- Software interrupts - when a software signals CPU that it needs kernel attention. These interrupts are generally used for system calls;
- Hardware interrupts - when a hardware event happens, for example button is pressed on a keyboard;
- Exceptions - interrupts generated by CPU, when the CPU detects error, for example division by zero or accessing a memory page which is not in RAM.
Every interrupt and exception is assigned a unique number which called - vector number
. Vector number
can be any number from 0
to 255
. There is common practice to use first 32
vector numbers for exceptions, and vector numbers from 32
to 255
are used for user-defined interrupts. We can see it in the code above - NUM_EXCEPTION_VECTORS
, which defined as:
#define NUM_EXCEPTION_VECTORS 32
CPU uses vector number as an index in the Interrupt Descriptor Table
(we will see description of it soon). CPU catch interrupts from the APIC or through it's pins. Following table shows 0-31
exceptions:
----------------------------------------------------------------------------------------------
|Vector|Mnemonic|Description |Type |Error Code|Source |
----------------------------------------------------------------------------------------------
|0 | #DE |Divide Error |Fault|NO |DIV and IDIV |
|---------------------------------------------------------------------------------------------
|1 | #DB |Reserved |F/T |NO | |
|---------------------------------------------------------------------------------------------
|2 | --- |NMI |INT |NO |external NMI |
|---------------------------------------------------------------------------------------------
|3 | #BP |Breakpoint |Trap |NO |INT 3 |
|---------------------------------------------------------------------------------------------
|4 | #OF |Overflow |Trap |NO |INTO instruction |
|---------------------------------------------------------------------------------------------
|5 | #BR |Bound Range Exceeded|Fault|NO |BOUND instruction |
|---------------------------------------------------------------------------------------------
|6 | #UD |Invalid Opcode |Fault|NO |UD2 instruction |
|---------------------------------------------------------------------------------------------
|7 | #NM |Device Not Available|Fault|NO |Floating point or [F]WAIT |
|---------------------------------------------------------------------------------------------
|8 | #DF |Double Fault |Abort|YES |An instruction which can generate NMI |
|---------------------------------------------------------------------------------------------
|9 | --- |Reserved |Fault|NO | |
|---------------------------------------------------------------------------------------------
|10 | #TS |Invalid TSS |Fault|YES |Task switch or TSS access |
|---------------------------------------------------------------------------------------------
|11 | #NP |Segment Not Present |Fault|NO |Accessing segment register |
|---------------------------------------------------------------------------------------------
|12 | #SS |Stack-Segment Fault |Fault|YES |Stack operations |
|---------------------------------------------------------------------------------------------
|13 | #GP |General Protection |Fault|YES |Memory reference |
|---------------------------------------------------------------------------------------------
|14 | #PF |Page fault |Fault|YES |Memory reference |
|---------------------------------------------------------------------------------------------
|15 | --- |Reserved | |NO | |
|---------------------------------------------------------------------------------------------
|16 | #MF |x87 FPU fp error |Fault|NO |Floating point or [F]Wait |
|---------------------------------------------------------------------------------------------
|17 | #AC |Alignment Check |Fault|YES |Data reference |
|---------------------------------------------------------------------------------------------
|18 | #MC |Machine Check |Abort|NO | |
|---------------------------------------------------------------------------------------------
|19 | #XM |SIMD fp exception |Fault|NO |SSE[2,3] instructions |
|---------------------------------------------------------------------------------------------
|20 | #VE |Virtualization exc. |Fault|NO |EPT violations |
|---------------------------------------------------------------------------------------------
|21-31 | --- |Reserved |INT |NO |External interrupts |
----------------------------------------------------------------------------------------------
To react on interrupt CPU uses special structure - Interrupt Descriptor Table or IDT. IDT is an array of 8-byte descriptors like Global Descriptor Table, but IDT entries are called gates
. CPU multiplies vector number on 8 to find index of the IDT entry. But in 64-bit mode IDT is an array of 16-byte descriptors and CPU multiplies vector number on 16 to find index of the entry in the IDT. We remember from the previous part that CPU uses special GDTR
register to locate Global Descriptor Table, so CPU uses special register IDTR
for Interrupt Descriptor Table and lidt
instruction for loading base address of the table into this register.
64-bit mode IDT entry has following structure:
127 96
--------------------------------------------------------------------------------
| |
| Reserved |
| |
--------------------------------------------------------------------------------
95 64
--------------------------------------------------------------------------------
| |
| Offset 63..32 |
| |
--------------------------------------------------------------------------------
63 48 47 46 44 42 39 34 32
--------------------------------------------------------------------------------
| | | D | | | | | | |
| Offset 31..16 | P | P | 0 |Type |0 0 0 | 0 | 0 | IST |
| | | L | | | | | | |
--------------------------------------------------------------------------------
31 16 15 0
--------------------------------------------------------------------------------
| | |
| Segment Selector | Offset 15..0 |
| | |
--------------------------------------------------------------------------------
Where:
Offset
- is offset to entry point of an interrupt handler;DPL
- Descriptor Privilege Level;P
- Segment Present flag;Segment selector
- a code segment selector in GDT or LDTIST
- provides ability to switch to a new stack for interrupts handling.
And the last Type
field describes type of the IDT
entry. There are three different kinds of handlers for interrupts:
- Task descriptor
- Interrupt descriptor
- Trap descriptor
Interrupt and trap descriptors contain a far pointer to the entry point of the interrupt handler. Only one difference between these types is how CPU handles IF
flag. If interrupt handler was accessed through interrupt gate, CPU clear the IF
flag to prevent other interrupts while current interrupt handler executes. After that current interrupt handler executes, CPU sets the IF
flag again with iret
instruction.
Other bits in the interrupt gate reserved and must be 0. Now let's look how CPU handles interrupts:
- CPU save flags register,
CS
, and instruction pointer on the stack. - If interrupt causes an error code (like
#PF
for example), CPU saves an error on the stack after instruction pointer; - After interrupt handler executed,
iret
instruction used to return from it.
Now let's back to code.
Fill and load IDT
We stopped at the following point:
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
set_intr_gate(i, early_idt_handler_array[i]);
Here we call set_intr_gate
in the loop, which takes two parameters:
- Number of an interrupt or
vector number
; - Address of the idt handler.
and inserts an interrupt gate to the IDT
table which is represented by the &idt_descr
array. First of all let's look on the early_idt_handler_array
array. It is an array which is defined in the arch/x86/include/asm/segment.h header file contains addresses of the first 32
exception handlers:
#define EARLY_IDT_HANDLER_SIZE 9
#define NUM_EXCEPTION_VECTORS 32
extern const char early_idt_handler_array[NUM_EXCEPTION_VECTORS][EARLY_IDT_HANDLER_SIZE];
The early_idt_handler_array
is 288
bytes array which contains address of exception entry points every nine bytes. Every nine bytes of this array consist of two bytes optional instruction for pushing dummy error code if an exception does not provide it, two bytes instruction for pushing vector number to the stack and five bytes of jump
to the common exception handler code.
As we can see, We're filling only first 32 IDT
entries in the loop, because all of the early setup runs with interrupts disabled, so there is no need to set up interrupt handlers for vectors greater than 32
. The early_idt_handler_array
array contains generic idt handlers and we can find its definition in the arch/x86/kernel/head_64.S assembly file. For now we will skip it, but will look it soon. Before this we will look on the implementation of the set_intr_gate
function.
The set_intr_gate
function is defined in the arch/x86/kernel/idt.c source file and looks:
static void set_intr_gate(unsigned int n, const void *addr)
{
struct idt_data data;
BUG_ON(n > 0xFF);
memset(&data, 0, sizeof(data));
data.vector = n;
data.addr = addr;
data.segment = __KERNEL_CS;
data.bits.type = GATE_INTERRUPT;
data.bits.p = 1;
idt_setup_from_table(idt_table, &data, 1, false);
}
First of all it checks with that passed interrupt number is not greater than 255
with BUG_ON
macro. We need to do this check because we can have only 256
interrupts. After this, we setup the idt data with the given values. And then we call idt_setup_from_table
function which looks like:
static void
idt_setup_from_table(gate_desc *idt, const struct idt_data *t, int size, bool sys)
{
gate_desc desc;
for (; size > 0; t++, size--) {
desc.offset_low = (u16) t->addr;
desc.segment = (u16) t->segment
desc.bits = t->bits;
desc.offset_middle = (u16) (t->addr >> 16);
desc.offset_high = (u32) (t->addr >> 32);
desc.reserved = 0;
memcpy(&idt[t->vector], &desc, sizeof(desc));
if (sys)
set_bit(t->vector, system_vectors);
}
}
which fill three parts of the address of the interrupt handler with the address which we got in the main loop (address of the interrupt handler entry point). And then we just copy the gate descriptor to the idt entry.
After that main loop will finished, we will have filled idt_table
array of gate_desc
structures and we can load Interrupt Descriptor table
with the call of the:
load_idt((const struct desc_ptr *)&idt_descr);
Where idt_descr
is:
struct desc_ptr idt_descr __ro_after_init = {
.size = (IDT_ENTRIES * 2 * sizeof(unsigned long)) - 1,
.address = (unsigned long) idt_table,
};
and load_idt
just executes lidt
instruction:
asm volatile("lidt %0"::"m" (*dtr));
You can note that there are calls of the _trace_*
functions in the _set_gate
and other functions. These functions fills IDT
gates in the same manner that _set_gate
but with one difference. These functions use trace_idt_table
the Interrupt Descriptor Table
instead of idt_table
for tracepoints (we will cover this theme in the another part).
Okay, now we have filled and loaded Interrupt Descriptor Table
, we know how the CPU acts during an interrupt. So now time to deal with interrupts handlers.
Early interrupts handlers
As you can read above, we filled IDT
with the address of the early_idt_handler_array
. We can find it in the arch/x86/kernel/head_64.S assembly file:
ENTRY(early_idt_handler_array)
i = 0
.rept NUM_EXCEPTION_VECTORS
.if ((EXCEPTION_ERRCODE_MASK >> i) & 1) == 0
UNWIND_HINT_IRET_REGS
pushq $0 # Dummy error code, to make stack frame uniform
.else
UNWIND_HINT_IRET_REGS offset=8
.endif
pushq $i # 72(%rsp) Vector number
jmp early_idt_handler_common
UNWIND_HINT_IRET_REGS
i = i + 1
.fill early_idt_handler_array + i*EARLY_IDT_HANDLER_SIZE - ., 1, 0xcc
.endr
UNWIND_HINT_IRET_REGS offset=16
END(early_idt_handler_array)
Functions which tend to not be called directly by other functions, such as syscall and interrupt handlers, often do unusual non-C-function-type things with the stack pointer. Such code needs to be annotated by using UNWIND_HINT_IRET_REGS
macro such that objtool can understand it. We can see here, interrupt handlers generation for the first 32
exceptions. We check here, if exception has an error code then we do nothing, if exception does not return error code, we push zero to the stack. We do it for that would stack was uniform. After that we push exception number on the stack and jump on the early_idt_handler_array
which is generic interrupt handler for now. As we may see above, every nine bytes of the early_idt_handler_array
array consists from optional push of an error code, push of vector number
and jump instruction. We can see it in the output of the objdump
util:
$ objdump -D vmlinux
...
...
...
ffffffff81fe5000 <early_idt_handler_array>:
ffffffff81fe5000: 6a 00 pushq $0x0
ffffffff81fe5002: 6a 00 pushq $0x0
ffffffff81fe5004: e9 17 01 00 00 jmpq ffffffff81fe5120 <early_idt_handler_common>
ffffffff81fe5009: 6a 00 pushq $0x0
ffffffff81fe500b: 6a 01 pushq $0x1
ffffffff81fe500d: e9 0e 01 00 00 jmpq ffffffff81fe5120 <early_idt_handler_common>
ffffffff81fe5012: 6a 00 pushq $0x0
ffffffff81fe5014: 6a 02 pushq $0x2
...
...
...
As i wrote above, CPU pushes flag register, CS
and RIP
on the stack. So before early_idt_handler
will be executed, stack will contain following data:
|--------------------|
| %rflags |
| %cs |
| %rip |
| error code | <-- %rsp
|--------------------|
Now let's look on the early_idt_handler_common
implementation. It locates in the same arch/x86/kernel/head_64.S assembly file and first of all we increment early_recursion_flag
to prevent recursion in the early_idt_handler_common
:
incl early_recursion_flag(%rip)
Next we save general registers on the stack:
pushq %rsi
movq 8(%rsp), %rsi
movq %rdi, 8(%rsp)
pushq %rdx
pushq %rcx
pushq %rax
pushq %r8
pushq %r9
pushq %r10
pushq %r11
pushq %rbx
pushq %rbp
pushq %r12
pushq %r13
pushq %r14
pushq %r15
UNWIND_HINT_REGS
We need to do it to prevent wrong values of registers when we return from the interrupt handler. After this we check the vector number, and if it is #PF
or Page Fault, we put value from the cr2
to the rdi
register and call early_make_pgtable
(we'll see it soon):
cmpq $14,%rsi
jnz 10f
GET_CR2_INTO(%rdi)
call early_make_pgtable
andl %eax,%eax
jz 20f
If vector number is not #PF
, we call early_fixup_exception
function with passing kernel stack pointer. (refer to x86-64 calling convention):
10:
movq %rsp,%rdi
call early_fixup_exception
We'll see the implementaion of the early_fixup_exception
function later.
20:
decl early_recursion_flag(%rip)
jmp restore_regs_and_return_to_kernel
After we decrement the early_recursion_flag
, we restore registers which we saved earlier from the stack and return from the handler with iretq
.
It is the end of the first interrupt handler. Note that it is very early interrupt handler, so it handles only Page Fault now. We will see handlers for the other interrupts, but now let's look on the page fault handler.
Page fault handling
In the previous paragraph we saw first early interrupt handler which checks interrupt number for page fault and calls early_make_pgtable
for building new page tables if it is. We need to have #PF
handler in this step because there are plans to add ability to load kernel above 4G
and make access to boot_params
structure above the 4G.
You can find implementation of the early_make_pgtable
in the arch/x86/kernel/head64.c and takes one parameter - address from the cr2
register, which caused Page Fault. Let's look on it:
int __init early_make_pgtable(unsigned long address)
{
unsigned long physaddr = address - __PAGE_OFFSET;
pmdval_t pmd;
pmd = (physaddr & PMD_MASK) + early_pmd_flags;
return __early_make_pgtable(address, pmd);
}
Next we call __early_make_pgtable
function which is defined in the same file as early_make_pgtable
function as following:
int __init __early_make_pgtable(unsigned long address, pmdval_t pmd)
{
unsigned long physaddr = address - __PAGE_OFFSET;
pgdval_t pgd, *pgd_p;
p4dval_t p4d, *p4d_p;
pudval_t pud, *pud_p;
pmdval_t *pmd_p;
...
...
...
}
It starts from the definition of some variables which have *val_t
types. All of these types are just:
typedef unsigned long pgdval_t;
Also we will operate with the *_t
(not val) types, for example pgd_t
and etc... All of these types are defined in the arch/x86/include/asm/pgtable_types.h and represent structures like this:
typedef struct { pgdval_t pgd; } pgd_t;
For example,
extern pgd_t early_top_pgt[PTRS_PER_PGD];
Here early_top_pgt
presents early top-level page table directory which consists of an array of pgd_t
types and pgd
points to low-level page entries.
After we made the check that we have no invalid address, we're getting the address of the Page Global Directory entry which contains #PF
address and put its value to the pgd
variable:
pgd_p = &early_top_pgt[pgd_index(address)].pgd;
pgd = *pgd_p;
Next we check if five-layer paging is enabled:
if (!pgtable_l5_enabled())
p4d_p = pgd_p;
In most cases five-layer paging is not enabled, so p4d_p
most likely equals to pgd_p
.
After this we fix up address of the p4d with:
p4d_p += p4d_index(address);
p4d = *p4d_p;
In the next step we check p4d
, if it contains correct p4d entry we put physical address of the p4d entry and put it to the pud_p
with:
pud_p = (pudval_t *)((p4d & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
where PTE_PFN_MASK
is a macro:
#define PTE_PFN_MASK ((pteval_t)PHYSICAL_PAGE_MASK)
which expands to:
(signed long)(~(PAGE_SIZE-1)) & ((1 << 52) - 1)
Here sign-extension is used. To be more expanded:
0b1111111111111111111111111111111111111111111111111111
which is 52 bits to mask page frame.
If p4d
does not contain correct address we check that next_early_pgt
is not greater than EARLY_DYNAMIC_PAGE_TABLES
which is 64
and present a fixed number of buffers to set up new page tables on demand. If next_early_pgt
is greater than EARLY_DYNAMIC_PAGE_TABLES
we reset page tables and start again. If next_early_pgt
is less than EARLY_DYNAMIC_PAGE_TABLES
, we create new page upper directory pointer which points to the current dynamic page table and writes its physical address with the _KERNPG_TABLE
access rights to the p4d:
if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
reset_early_page_tables();
goto again;
}
pud_p = (pudval_t *)early_dynamic_pgts[next_early_pgt++];
for (i = 0; i < PTRS_PER_PUD; i++)
pud_p[i] = 0;
*p4d_p = (p4dval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
As we did above, we fix up address of the page upper directory with:
pud_p += pud_index(address);
pud = *pud_p;
In the next step we do the same actions as we did before, but with the page middle directory. In the end we fix address of the page middle directory which contains maps kernel text+data virtual addresses:
pmd_p[pmd_index(address)] = pmd;
After page fault handler finished its work and as result our early_top_pgt
contains entries which point to the valid addresses.
Other exception handling
In early interrupt phase, exceptions other than page fault are handled by early_fixup_exception
function which is defined in arch/x86/mm/extable.c and takes two parameters - pointer to kernel stack which consists of saved registers and interrupt number:
void __init early_fixup_exception(struct pt_regs *regs, int trapnr)
{
...
...
...
}
First of all we need to pass some condition expressions.
if (trapnr == X86_TRAP_NMI)
return;
if (early_recursion_flag > 2)
goto halt_loop;
if (!xen_pv_domain() && regs->cs != __KERNEL_CS)
goto fail;
Here we just ignore NMI. And we make sure that we are not in recursive situation. After that, we get into:
if (fixup_exception(regs, trapnr))
return;
The fixup_exception
function is defined in the same file as early_fixup_exception
function and looks like:
int fixup_exception(struct pt_regs *regs, int trapnr)
{
const struct exception_table_entry *e;
ex_handler_t handler;
e = search_exception_tables(regs->ip);
if (!e)
return 0;
handler = ex_fixup_handler(e);
return handler(e, regs, trapnr);
}
The ex_handler_t
is a type of function pointer, which is defined like:
typedef bool (*ex_handler_t)(const struct exception_table_entry *,
struct pt_regs *, int)
The search_exception_tables
function looks up the given address in the exception table (i.e. the contents of the ELF section __ex_table). After that, we get the actual address by ex_fixup_handler
function. At last we call actual handler. For more information about exception table, you can refer to Documentation/x86/exception-tables.txt.
Back to early_fixup_exception
function, the next step is:
if (fixup_bug(regs, trapnr))
return;
The fixup_bug
function is defined in arch/x86/kernel/traps.c. Let's have a look on the function implementation.
int fixup_bug(struct pt_regs *regs, int trapnr)
{
if (trapnr != X86_TRAP_UD)
return 0;
switch (report_bug(regs->ip, regs)) {
case BUG_TRAP_TYPE_NONE:
case BUG_TRAP_TYPE_BUG:
break;
case BUG_TRAP_TYPE_WARN:
regs->ip += LEN_UD2;
return 1;
}
return 0;
}
All what this funtion do is just return 1
if the exception is generated because #UD
(or Invalid Opcode) occured and the report_bug
function returns BUG_TRAP_TYPE_WARN
, otherwise return 0
.
Conclusion
This is the end of the second part about linux kernel insides. If you have questions or suggestions, ping me in twitter 0xAX, drop me email or just create issue. In the next part we will see all steps before kernel entry point - start_kernel
function.
Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to linux-insides.