mirror of
https://github.com/0xAX/linux-insides.git
synced 2025-05-12 09:18:47 +00:00
Many fixes for initialization and MM related parts
This commit is contained in:
parent
8360182c53
commit
a1e3af3c55
@ -20,6 +20,7 @@ First steps in the kernel
|
||||
Okay, we got the address of the decompressed kernel image from the `decompress_kernel` function into `rax` register and just jumped there. As we already know the entry point of the decompressed kernel image starts in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) assembly source code file and at the beginning of it, we can see following definitions:
|
||||
|
||||
```assembly
|
||||
.text
|
||||
__HEAD
|
||||
.code64
|
||||
.globl startup_64
|
||||
@ -91,7 +92,11 @@ Here we just compare low part of the `rbp` register with the complemented value
|
||||
|
||||
```C
|
||||
#define PMD_PAGE_MASK (~(PMD_PAGE_SIZE-1))
|
||||
```
|
||||
|
||||
where `PMD_PAGE_SIZE` macro defined as:
|
||||
|
||||
```
|
||||
#define PMD_PAGE_SIZE (_AC(1, UL) << PMD_SHIFT)
|
||||
#define PMD_SHIFT 21
|
||||
```
|
||||
@ -207,10 +212,6 @@ After this we store address of the `_text` in the `rax` and get the index of the
|
||||
```assembly
|
||||
movq %rdi, %rax
|
||||
shrq $PGDIR_SHIFT, %rax
|
||||
|
||||
leaq (4096 + _KERNPG_TABLE)(%rbx), %rdx
|
||||
movq %rdx, 0(%rbx,%rax,8)
|
||||
movq %rdx, 8(%rbx,%rax,8)
|
||||
```
|
||||
|
||||
where `PGDIR_SHIFT` is `39`. `PGDIR_SHFT` indicates the mask for page global directory bits in a virtual address. There are macro for all types of page directories:
|
||||
@ -221,45 +222,50 @@ where `PGDIR_SHIFT` is `39`. `PGDIR_SHFT` indicates the mask for page global dir
|
||||
#define PMD_SHIFT 21
|
||||
```
|
||||
|
||||
After this we put the address of the first `level3_kernel_pgt` in the `rdx` with the `_KERNPG_TABLE` access rights (see above) and fill the `early_level4_pgt` with the 2 `level3_kernel_pgt` entries.
|
||||
After this we put the address of the first entry of the `early_dynamic_pgts` page table to the `rdx` register with the `_KERNPG_TABLE` access rights (see above) and fill the `early_level4_pgt` with the 2 `early_dynamic_pgts` entries:
|
||||
|
||||
After this we add `4096` (size of the `early_level4_pgt`) to the `rdx` (it now contains the address of the first entry of the `level3_kernel_pgt`) and put `rdi` (it now contains physical address of the `_text`) to the `rax`. And after this we write addresses of the two page upper directory entries to the `level3_kernel_pgt`:
|
||||
```assembly
|
||||
leaq (4096 + _KERNPG_TABLE)(%rbx), %rdx
|
||||
movq %rdx, 0(%rbx,%rax,8)
|
||||
movq %rdx, 8(%rbx,%rax,8)
|
||||
```
|
||||
|
||||
The `rbx` register contains address of the `early_level4_pgt` and `%rax * 8` here is index of a page global directory occupied by the `_text` address. So here we fill two entries of the `early_level4_pgt` with address of two entries of the `early_dynamic_pgts` which is related to `_text`. The `early_dynamic_pgts` is array of arrays:
|
||||
|
||||
```C
|
||||
extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
|
||||
```
|
||||
|
||||
which will store temporary page tables for early kernel which we will not move to the `init_level4_pgt`.
|
||||
|
||||
After this we add `4096` (size of the `early_level4_pgt`) to the `rdx` (it now contains the address of the first entry of the `early_dynamic_pgts`) and put `rdi` (it now contains physical address of the `_text`) to the `rax`. Now we shift address of the `_text` ot `PUD_SHIFT` to get index of an entry from page upper directory which contains this address and clears high bits to get only `pud` related part:
|
||||
|
||||
```assembly
|
||||
addq $4096, %rdx
|
||||
movq %rdi, %rax
|
||||
shrq $PUD_SHIFT, %rax
|
||||
andl $(PTRS_PER_PUD-1), %eax
|
||||
```
|
||||
|
||||
As we have index of a page upper directory we write two addresses of the second entry of the `early_dynamic_pgts` array to the first entry of this temporary page directory:
|
||||
|
||||
```assembly
|
||||
movq %rdx, 4096(%rbx,%rax,8)
|
||||
incl %eax
|
||||
andl $(PTRS_PER_PUD-1), %eax
|
||||
movq %rdx, 4096(%rbx,%rax,8)
|
||||
```
|
||||
|
||||
In the next step we write addresses of the page middle directory entries to the `level2_kernel_pgt` and the last step is correcting of the kernel text+data virtual addresses:
|
||||
In the next step we do the same operation for last page table directory, but filling not two entries, but all entries to cover full size of the kernel.
|
||||
|
||||
After our early page table directories filled, we put physical address of the `early_level4_pgt` to the `rax` register and jump to label `1`:
|
||||
|
||||
```assembly
|
||||
leaq level2_kernel_pgt(%rip), %rdi
|
||||
leaq 4096(%rdi), %r8
|
||||
1: testq $1, 0(%rdi)
|
||||
jz 2f
|
||||
addq %rbp, 0(%rdi)
|
||||
2: addq $8, %rdi
|
||||
cmp %r8, %rdi
|
||||
jne 1b
|
||||
```
|
||||
|
||||
Here we put the address of the `level2_kernel_pgt` to the `rdi` and address of the page table entry to the `r8` register. Next we check the present bit in the `level2_kernel_pgt` and if it is zero we're moving to the next page by adding 8 bytes to `rdi` which contains address of the `level2_kernel_pgt`. After this we compare it with `r8` (contains address of the page table entry) and go back to label `1` or move forward.
|
||||
|
||||
In the next step we correct `phys_base` physical address with `rbp` (contains physical address of the `_text`), put physical address of the `early_level4_pgt` and jump to label `1`:
|
||||
|
||||
```assembly
|
||||
addq %rbp, phys_base(%rip)
|
||||
movq $(early_level4_pgt - __START_KERNEL_map), %rax
|
||||
jmp 1f
|
||||
```
|
||||
|
||||
where `phys_base` matches the first entry of the `level2_kernel_pgt` which is `512` MB kernel mapping.
|
||||
That's all for now. Our early paging is prepared and we just need to finish last preparation before we will jump into C code and kernel entry point later.
|
||||
|
||||
Last preparation before jump at the kernel entry point
|
||||
--------------------------------------------------------------------------------
|
||||
@ -343,15 +349,15 @@ movq %rax, %cr0
|
||||
We already know that to run any code, and even more [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) code from assembly, we need to setup a stack. As always, we are doing it by the setting of [stack pointer](https://en.wikipedia.org/wiki/Stack_register) to a correct place in memory and resetting [flags](https://en.wikipedia.org/wiki/FLAGS_register) register after this:
|
||||
|
||||
```assembly
|
||||
movq stack_start(%rip), %rsp
|
||||
movq initial_stack(%rip), %rsp
|
||||
pushq $0
|
||||
popfq
|
||||
```
|
||||
|
||||
The most interesting thing here is the `stack_start`. It defined in the same [source](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) code file and looks like:
|
||||
The most interesting thing here is the `initial_stack`. This symbol is defined in the [source](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) code file and looks like:
|
||||
|
||||
```assembly
|
||||
GLOBAL(stack_start)
|
||||
GLOBAL(initial_stack)
|
||||
.quad init_thread_union+THREAD_SIZE-8
|
||||
```
|
||||
|
||||
@ -372,7 +378,7 @@ The `THREAD_SIZE` macro is defined in the [arch/x86/include/asm/page_64_types.h]
|
||||
|
||||
We consider when the [kasan](http://lxr.free-electrons.com/source/Documentation/kasan.txt) is disabled and the `PAGE_SIZE` is `4096` bytes. So the `THREAD_SIZE` will expands to `16` kilobytes and represents size of the stack of a thread. Why is `thread`? You may already know that each [process](https://en.wikipedia.org/wiki/Process_%28computing%29) may have parent [processes](https://en.wikipedia.org/wiki/Parent_process) and [child](https://en.wikipedia.org/wiki/Child_process) processes. Actually, a parent process and child process differ in stack. A new kernel stack is allocated for a new process. In the Linux kernel this stack is represented by the [union](https://en.wikipedia.org/wiki/Union_type#C.2FC.2B.2B) with the `thread_info` structure.
|
||||
|
||||
And as we can see the `init_thread_union` is represented by the `thread_union`, which defined as:
|
||||
And as we can see the `init_thread_union` is represented by the `thread_union` [union](https://en.wikipedia.org/wiki/Union_type#C.2FC.2B.2B). Earlier this union looked like:
|
||||
|
||||
```C
|
||||
union thread_union {
|
||||
@ -381,46 +387,40 @@ union thread_union {
|
||||
};
|
||||
```
|
||||
|
||||
and `init_thread_union` looks like:
|
||||
but from the Linux kernel `4.9-rc1` release, `thread_info` was moved to the `task_struct` structure which represents a thread. So, for now `thread_union` looks like:
|
||||
|
||||
```C
|
||||
union thread_union init_thread_union __init_task_data =
|
||||
{ INIT_THREAD_INFO(init_task) };
|
||||
union thread_union {
|
||||
#ifndef CONFIG_THREAD_INFO_IN_TASK
|
||||
struct thread_info thread_info;
|
||||
#endif
|
||||
unsigned long stack[THREAD_SIZE/sizeof(long)];
|
||||
};
|
||||
```
|
||||
|
||||
Where the `INIT_THREAD_INFO` macro takes `task_struct` structure which represents process descriptor in the Linux kernel and does some basic initialization of the given `task_struct` structure:
|
||||
where the `CONFIG_THREAD_INFO_IN_TASK` kernel configuration option is enabled for `x86_64` architecture. So, as we consider only `x86_64` architecture in this book, an instance of `thread_union` will contain only stack and `thread_info` structure will be placed in the `task_struct`.
|
||||
|
||||
```C
|
||||
#define INIT_THREAD_INFO(tsk) \
|
||||
{ \
|
||||
.task = &tsk, \
|
||||
.flags = 0, \
|
||||
.cpu = 0, \
|
||||
.addr_limit = KERNEL_DS, \
|
||||
}
|
||||
```
|
||||
|
||||
So, the `thread_union` contains low-level information about a process and process's stack and placed in the bottom of stack:
|
||||
The `init_thread_union` looks like:
|
||||
|
||||
```
|
||||
+-----------------------+
|
||||
| |
|
||||
| |
|
||||
| |
|
||||
| Kernel stack |
|
||||
| |
|
||||
| |
|
||||
| |
|
||||
|-----------------------|
|
||||
| |
|
||||
| struct thread_info |
|
||||
| |
|
||||
+-----------------------+
|
||||
union thread_union init_thread_union __init_task_data = {
|
||||
#ifndef CONFIG_THREAD_INFO_IN_TASK
|
||||
INIT_THREAD_INFO(init_task)
|
||||
#endif
|
||||
};
|
||||
```
|
||||
|
||||
Note that we reserve `8` bytes at the to of stack. This is necessary to guarantee illegal access of the next page memory.
|
||||
which represents just thread stack. Now we may understand this expression:
|
||||
|
||||
After the early boot stack is set, to update the [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table) with `lgdt` instruction:
|
||||
```assembly
|
||||
GLOBAL(initial_stack)
|
||||
.quad init_thread_union+THREAD_SIZE-8
|
||||
```
|
||||
|
||||
|
||||
that `initial_stack` symbol points to the start of the `thread_union.stack` array + `THREAD_SIZE` which is 16 killobytes and - 8 bytes. Here we need to subtract `8` bytes at the to of stack. This is necessary to guarantee illegal access of the next page memory.
|
||||
|
||||
After the early boot stack is set, to update the [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table) with the `lgdt` instruction:
|
||||
|
||||
```assembly
|
||||
lgdt early_gdt_descr(%rip)
|
||||
@ -441,7 +441,9 @@ We need to reload `Global Descriptor Table` because now kernel works in the low
|
||||
#define GDT_ENTRIES 32
|
||||
```
|
||||
|
||||
for kernel code, data, thread local storage segments and etc... it's simple. Now let's look at the `early_gdt_descr_base`. First of `gdt_page` defined as:
|
||||
for kernel code, data, thread local storage segments and etc... it's simple. Now let's look at the definition of the `early_gdt_descr_base`.
|
||||
|
||||
First of `gdt_page` defined as:
|
||||
|
||||
```C
|
||||
struct gdt_page {
|
||||
@ -517,9 +519,8 @@ In the next step we put the address of the real mode bootparam structure to the
|
||||
|
||||
```assembly
|
||||
movq initial_code(%rip), %rax
|
||||
pushq $0
|
||||
pushq $__KERNEL_CS
|
||||
pushq %rax
|
||||
pushq $__KERNEL_CS # set correct cs
|
||||
pushq %rax # target address in negative space
|
||||
lretq
|
||||
```
|
||||
|
||||
|
@ -130,7 +130,7 @@ void set_task_stack_end_magic(struct task_struct *tsk)
|
||||
}
|
||||
```
|
||||
|
||||
Its implementation is simple. `set_task_stack_end_magic` gets the end of the stack for the given `task_struct` with the `end_of_stack` function. The end of a process stack depends on the `CONFIG_STACK_GROWSUP` configuration option. As we learn in `x86_64` architecture, the stack grows down. So the end of the process stack will be:
|
||||
Its implementation is simple. `set_task_stack_end_magic` gets the end of the stack for the given `task_struct` with the `end_of_stack` function. Earlier (and now for all architectures besides `x86_64`) stack was located in the `thread_info` structure. So the end of a process stack depends on the `CONFIG_STACK_GROWSUP` configuration option. As we learn in `x86_64` architecture, the stack grows down. So the end of the process stack will be:
|
||||
|
||||
```C
|
||||
(unsigned long *)(task_thread_info(p) + 1);
|
||||
@ -142,6 +142,45 @@ where `task_thread_info` just returns the stack which we filled with the `INIT_T
|
||||
#define task_thread_info(task) ((struct thread_info *)(task)->stack)
|
||||
```
|
||||
|
||||
From the Linux kernel `v4.9-rc1` release, `thread_info` structure may contains only flags and stack pointer resides in `task_struct` structure which represents a thread in the Linux kernel. This depends on `CONFIG_THREAD_INFO_IN_TASK` kernel configuration option which is enabled by default for `x86_64`. You can be sure in this if you will look in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) configuration build file:
|
||||
|
||||
```
|
||||
config THREAD_INFO_IN_TASK
|
||||
bool
|
||||
help
|
||||
Select this to move thread_info off the stack into task_struct. To
|
||||
make this work, an arch will need to remove all thread_info fields
|
||||
except flags and fix any runtime bugs.
|
||||
|
||||
One subtle change that will be needed is to use try_get_task_stack()
|
||||
and put_task_stack() in save_thread_stack_tsk() and get_wchan().
|
||||
```
|
||||
|
||||
and [arch/x86/Kconfig](https://github.com/torvalds/linux/blob/master/arch/x86/Kconfig):
|
||||
|
||||
```
|
||||
config X86
|
||||
def_bool y
|
||||
...
|
||||
...
|
||||
...
|
||||
select THREAD_INFO_IN_TASK
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
So, in this way we may just get end of a thread stack from the given `task_struct` structure:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_THREAD_INFO_IN_TASK
|
||||
static inline unsigned long *end_of_stack(const struct task_struct *task)
|
||||
{
|
||||
return task->stack;
|
||||
}
|
||||
#endif
|
||||
```
|
||||
|
||||
As we got the end of the init process stack, we write `STACK_END_MAGIC` there. After `canary` is set, we can check it like this:
|
||||
|
||||
```C
|
||||
|
@ -29,7 +29,9 @@ We already saw implementation of the `set_intr_gate` in the previous part about
|
||||
|
||||
* number of the interrupt;
|
||||
* base address of the interrupt/exception handler;
|
||||
* third parameter is - `Interrupt Stack Table`. `IST` is a new mechanism in the `x86_64` and part of the [TSS](http://en.wikipedia.org/wiki/Task_state_segment). Every active thread in kernel mode has own kernel stack which is 16 kilobytes. While a thread in user space, kernel stack is empty except `thread_info` (read about it previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html)) at the bottom. In addition to per-thread stacks, there are a couple of specialized stacks associated with each CPU. All about these stack you can read in the linux kernel documentation - [Kernel stacks](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks). `x86_64` provides feature which allows to switch to a new `special` stack for during any events as non-maskable interrupt and etc... And the name of this feature is - `Interrupt Stack Table`. There can be up to 7 `IST` entries per CPU and every entry points to the dedicated stack. In our case this is `DEBUG_STACK`.
|
||||
* third parameter is - `Interrupt Stack Table`. `IST` is a new mechanism in the `x86_64` and part of the [TSS](http://en.wikipedia.org/wiki/Task_state_segment). Every active thread in kernel mode has own kernel stack which is `16` kilobytes. While a thread in user space, this kernel stack is empty.
|
||||
|
||||
In addition to per-thread stacks, there are a couple of specialized stacks associated with each CPU. All about these stack you can read in the linux kernel documentation - [Kernel stacks](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks). `x86_64` provides feature which allows to switch to a new `special` stack for during any events as non-maskable interrupt and etc... And the name of this feature is - `Interrupt Stack Table`. There can be up to 7 `IST` entries per CPU and every entry points to the dedicated stack. In our case this is `DEBUG_STACK`.
|
||||
|
||||
`set_intr_gate_ist` and `set_system_intr_gate_ist` work by the same principle as `set_intr_gate` with only one difference. Both of these functions checks
|
||||
interrupt number and call `_set_gate` inside:
|
||||
|
Loading…
Reference in New Issue
Block a user