mirror of
https://github.com/0xAX/linux-insides.git
synced 2025-01-03 12:20:56 +00:00
Merge pull request #625 from tkyymmt/init-1
Update linux-initialization-1
This commit is contained in:
commit
8c6d5d5df7
@ -4,20 +4,20 @@ Kernel initialization. Part 1.
|
|||||||
First steps in the kernel code
|
First steps in the kernel code
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
|
|
||||||
The previous [post](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-6.html) was a last part of the Linux kernel [booting process](https://0xax.gitbooks.io/linux-insides/content/Booting/index.html) chapter and now we are starting to dive into initialization process of the Linux kernel. After the image of the Linux kernel is decompressed and placed in a correct place in memory, it starts to work. All previous parts describe the work of the Linux kernel setup code which does preparation before the first bytes of the Linux kernel code will be executed. From now we are in the kernel and all parts of this chapter will be devoted to the initialization process of the kernel before it will launch process with [pid](https://en.wikipedia.org/wiki/Process_identifier) `1`. There are many things to do before the kernel will start first `init` process. Hope we will see all of the preparations before kernel will start in this big chapter. We will start from the kernel entry point, which is located in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head_64.S) and will move further and further. We will see first preparations like early page tables initialization, switch to a new descriptor in kernel space and many many more, before we will see the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c#L489) will be called.
|
The previous [post](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-6.html) was a last part of the Linux kernel [booting process](https://0xax.gitbooks.io/linux-insides/content/Booting/index.html) chapter and now we are starting to dive into initialization process of the Linux kernel. After the image of the Linux kernel is decompressed and placed in a correct place in memory, it starts to work. All previous parts describe the work of the Linux kernel setup code which does preparation before the first bytes of the Linux kernel code will be executed. From now we are in the kernel and all parts of this chapter will be devoted to the initialization process of the kernel before it will launch process with [pid](https://en.wikipedia.org/wiki/Process_identifier) `1`. There are many things to do before the kernel will start first `init` process. Hope we will see all of the preparations before kernel will start in this big chapter. We will start from the kernel entry point, which is located in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) and will move further and further. We will see first preparations like early page tables initialization, switch to a new descriptor in kernel space and many many more, before we will see the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) will be called.
|
||||||
|
|
||||||
In the last [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-6.html) of the previous [chapter](https://0xax.gitbooks.io/linux-insides/content/Booting/index.html) we stopped at the [jmp](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S) instruction from the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S) assembly source code file:
|
In the last [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-6.html) of the previous [chapter](https://0xax.gitbooks.io/linux-insides/content/Booting/index.html) we stopped at the jmp instruction from the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) assembly source code file:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
jmp *%rax
|
jmp *%rax
|
||||||
```
|
```
|
||||||
|
|
||||||
At this moment the `rax` register contains address of the Linux kernel entry point which that was obtained as a result of the call of the `decompress_kernel` function from the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/misc.c) source code file. So, our last instruction in the kernel setup code is a jump on the kernel entry point. We already know where is defined the entry point of the linux kernel, so we are able to start to learn what does the Linux kernel does after the start.
|
At this moment the `rax` register contains address of the Linux kernel entry point which was obtained as a result of the call of the `decompress_kernel` function from the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c) source code file. So, our last instruction in the kernel setup code is a jump on the kernel entry point. We already know where the entry point of the Linux kernel is defined, so we are able to start to learn what Linux kernel does after the start.
|
||||||
|
|
||||||
First steps in the kernel
|
First steps in the kernel
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
|
|
||||||
Okay, we got the address of the decompressed kernel image from the `decompress_kernel` function into `rax` register and just jumped there. As we already know the entry point of the decompressed kernel image starts in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head_64.S) assembly source code file and at the beginning of it, we can see following definitions:
|
Okay, we got the address of the decompressed kernel image from the `decompress_kernel` function into `rax` register and just jumped there. As we already know the entry point of the decompressed kernel image starts in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) assembly source code file and at the beginning of it, we can see following definitions:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
.text
|
.text
|
||||||
@ -36,7 +36,7 @@ We can see definition of the `startup_64` routine that is defined in the `__HEAD
|
|||||||
#define __HEAD .section ".head.text","ax"
|
#define __HEAD .section ".head.text","ax"
|
||||||
```
|
```
|
||||||
|
|
||||||
We can see definition of this section in the [arch/x86/kernel/vmlinux.lds.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/vmlinux.lds.S#L93) linker script:
|
We can see definition of this section in the [arch/x86/kernel/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S) linker script:
|
||||||
|
|
||||||
```
|
```
|
||||||
.text : AT(ADDR(.text) - LOAD_OFFSET) {
|
.text : AT(ADDR(.text) - LOAD_OFFSET) {
|
||||||
@ -53,7 +53,7 @@ Besides the definition of the `.text` section, we can understand default virtual
|
|||||||
. = __START_KERNEL;
|
. = __START_KERNEL;
|
||||||
```
|
```
|
||||||
|
|
||||||
for the [x86_64](https://en.wikipedia.org/wiki/X86-64). The definition of the `__START_KERNEL` macro is located in the [arch/x86/include/asm/page_types.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/page_types.h) header file and represented by the sum of the base virtual address of the kernel mapping and physical start:
|
for [x86_64](https://en.wikipedia.org/wiki/X86-64). The definition of the `__START_KERNEL` macro is located in the [arch/x86/include/asm/page_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/page_types.h) header file and represented by the sum of the base virtual address of the kernel mapping and physical start:
|
||||||
|
|
||||||
```C
|
```C
|
||||||
#define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START)
|
#define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START)
|
||||||
@ -66,55 +66,70 @@ Or in other words:
|
|||||||
* Base physical address of the Linux kernel - `0x1000000`;
|
* Base physical address of the Linux kernel - `0x1000000`;
|
||||||
* Base virtual address of the Linux kernel - `0xffffffff81000000`.
|
* Base virtual address of the Linux kernel - `0xffffffff81000000`.
|
||||||
|
|
||||||
Now we know default physical and virtual addresses of the `startup_64` routine, but to know actual addresses we must to calculate it with the following code:
|
After we sanitized CPU configuration, we call `__startup_64` function which is defined in [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c):
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
leaq _text(%rip), %rbp
|
leaq _text(%rip), %rdi
|
||||||
subq $_text - __START_KERNEL_map, %rbp
|
pushq %rsi
|
||||||
|
call __startup_64
|
||||||
|
popq %rsi
|
||||||
```
|
```
|
||||||
|
|
||||||
Yes, it defined as `0x1000000`, but it may be different, for example if [kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization#Linux) is enabled. So our current goal is to calculate delta between `0x1000000` and where we actually loaded. Here we just put the `rip-relative` address to the `rbp` register and then subtract `$_text - __START_KERNEL_map` from it. We know that compiled virtual address of the `_text` is `0xffffffff81000000` and the physical address of it is `0x1000000`. The `__START_KERNEL_map` macro expands to the `0xffffffff80000000` address, so at the second line of the assembly code, we will get following expression:
|
```C
|
||||||
|
unsigned log __head __startup_64(unsigned long physaddr,
|
||||||
```
|
struct boot_params *bp)
|
||||||
rbp = 0x1000000 - (0xffffffff81000000 - 0xffffffff80000000)
|
{
|
||||||
|
unsigned long load_delta, *p;
|
||||||
|
unsigned long pgtable_flags;
|
||||||
|
pgdval_t *pgd;
|
||||||
|
p4dval_t *p4d;
|
||||||
|
pudval_t *pud;
|
||||||
|
pmdval_t *pmd, pmd_entry;
|
||||||
|
pteval_t *mask_ptr;
|
||||||
|
bool la57;
|
||||||
|
int i;
|
||||||
|
unsigned int *next_pgt_ptr;
|
||||||
|
...
|
||||||
|
...
|
||||||
|
...
|
||||||
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
So, after the calculation, the `rbp` will contain `0` which represents difference between addresses where we actually loaded and where the code was compiled. In our case `zero` means that the Linux kernel was loaded by default address and the [kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization#Linux) was disabled.
|
Since [kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization#Linux) is enabled, the address `startup_64` routine was loaded may be different from the address compiled to run at, so we need to calculate the delta with the following code:
|
||||||
|
|
||||||
After we got the address of the `startup_64`, we need to do a check that this address is correctly aligned. We will do it with the following code:
|
```C
|
||||||
|
load_delta = physaddr - (unsigned long)(_text - __START_KERNEL_map);
|
||||||
```assembly
|
|
||||||
testl $~PMD_PAGE_MASK, %ebp
|
|
||||||
jnz bad_address
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Here we just compare low part of the `rbp` register with the complemented value of the `PMD_PAGE_MASK`. The `PMD_PAGE_MASK` indicates the mask for `Page middle directory` (read [Paging](https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-1.html) about it) and defined as:
|
As a result, `load_delta` contains the delta between the address compiled to run at and the address actually loaded.
|
||||||
|
|
||||||
|
After we got the delta, we check if `_text` address is correctly aligned for `2` megabytes. We will do it with the following code:
|
||||||
|
|
||||||
|
```C
|
||||||
|
if (load_delta & ~PMD_PAGE_MASK)
|
||||||
|
for (;;);
|
||||||
|
```
|
||||||
|
|
||||||
|
If `_text` address is not aligned for `2` megabytes, we enter infinite loop. The `PMD_PAGE_MASK` indicates the mask for `Page middle directory` (read [Paging](https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-1.html) about it) and is defined as:
|
||||||
|
|
||||||
```C
|
```C
|
||||||
#define PMD_PAGE_MASK (~(PMD_PAGE_SIZE-1))
|
#define PMD_PAGE_MASK (~(PMD_PAGE_SIZE-1))
|
||||||
```
|
```
|
||||||
|
|
||||||
where `PMD_PAGE_SIZE` macro defined as:
|
where `PMD_PAGE_SIZE` macro is defined as:
|
||||||
|
|
||||||
```
|
```C
|
||||||
#define PMD_PAGE_SIZE (_AC(1, UL) << PMD_SHIFT)
|
#define PMD_PAGE_SIZE (_AC(1, UL) << PMD_SHIFT)
|
||||||
#define PMD_SHIFT 21
|
#define PMD_SHIFT 21
|
||||||
```
|
```
|
||||||
|
|
||||||
As we can easily calculate, `PMD_PAGE_SIZE` is `2` megabytes. Here we use standard formula for checking alignment and if `text` address is not aligned for `2` megabytes, we jump to `bad_address` label.
|
As we can easily calculate, `PMD_PAGE_SIZE` is `2` megabytes.
|
||||||
|
|
||||||
After this we check address that it is not too large by the checking of highest `18` bits:
|
If [SME](https://en.wikipedia.org/wiki/Zen_(microarchitecture)#Enhanced_security_and_virtualization_support) is supported and enabled, we activate it and include the SME encryption mask in `load_delta`:
|
||||||
|
|
||||||
```assembly
|
|
||||||
leaq _text(%rip), %rax
|
|
||||||
shrq $MAX_PHYSMEM_BITS, %rax
|
|
||||||
jnz bad_address
|
|
||||||
```
|
|
||||||
|
|
||||||
The address must not be greater than `46`-bits:
|
|
||||||
|
|
||||||
```C
|
```C
|
||||||
#define MAX_PHYSMEM_BITS 46
|
sme_enable(bp);
|
||||||
|
load_delta += sme_get_me_mask();
|
||||||
```
|
```
|
||||||
|
|
||||||
Okay, we did some early checks and now we can move on.
|
Okay, we did some early checks and now we can move on.
|
||||||
@ -122,26 +137,34 @@ Okay, we did some early checks and now we can move on.
|
|||||||
Fix base addresses of page tables
|
Fix base addresses of page tables
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
|
|
||||||
The first step before we start to setup identity paging is to fixup following addresses:
|
In the next step we fixup the physical addresses in the page table:
|
||||||
|
|
||||||
```assembly
|
```C
|
||||||
addq %rbp, early_top_pgt + (L4_START_KERNEL*8)(%rip)
|
pgd = fixup_pointer(&early_top_pgt, physaddr);
|
||||||
addq %rbp, level3_kernel_pgt + (510*8)(%rip)
|
pud = fixup_pointer(&level3_kernel_pgt, physaddr);
|
||||||
addq %rbp, level3_kernel_pgt + (511*8)(%rip)
|
pmd = fixup_pointer(level2_fixmap_pgt, physaddr);
|
||||||
addq %rbp, level2_fixmap_pgt + (506*8)(%rip)
|
|
||||||
```
|
```
|
||||||
|
|
||||||
All of `early_top_pgt`, `level3_kernel_pgt` and other address may be wrong if the `startup_64` is not equal to default `0x1000000` address. The `rbp` register contains the delta address so we add to the certain entries of the `early_top_pgt`, the `level3_kernel_pgt` and the `level2_fixmap_pgt`. Let's try to understand what these labels mean. First of all let's look at their definition:
|
So, let's look at the definition of `fixup_pointer` function which returns physical address of the passed argument:
|
||||||
|
|
||||||
|
```C
|
||||||
|
static void __head *fixup_pointer(void *ptr, unsigned long physaddr)
|
||||||
|
{
|
||||||
|
return ptr - (void *)_text + (void *)physaddr;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Next we'll focus on `early_top_pgt` and the other page table symbols which we saw above. Let's try to understand what these symbols mean. First of all let's look at their definition:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
NEXT_PAGE(early_top_pgt)
|
NEXT_PAGE(early_top_pgt)
|
||||||
.fill 511,8,0
|
.fill 512,8,0
|
||||||
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
|
.fill PTI_USER_PGD_FILL,8,0
|
||||||
|
|
||||||
NEXT_PAGE(level3_kernel_pgt)
|
NEXT_PAGE(level3_kernel_pgt)
|
||||||
.fill L3_START_KERNEL,8,0
|
.fill L3_START_KERNEL,8,0
|
||||||
.quad level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
|
.quad level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
|
||||||
.quad level2_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE
|
.quad level2_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
|
||||||
|
|
||||||
NEXT_PAGE(level2_kernel_pgt)
|
NEXT_PAGE(level2_kernel_pgt)
|
||||||
PMDS(0, __PAGE_KERNEL_LARGE_EXEC,
|
PMDS(0, __PAGE_KERNEL_LARGE_EXEC,
|
||||||
@ -149,42 +172,45 @@ NEXT_PAGE(level2_kernel_pgt)
|
|||||||
|
|
||||||
NEXT_PAGE(level2_fixmap_pgt)
|
NEXT_PAGE(level2_fixmap_pgt)
|
||||||
.fill 506,8,0
|
.fill 506,8,0
|
||||||
.quad level1_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE
|
.quad level1_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
|
||||||
.fill 5,8,0
|
.fill 5,8,0
|
||||||
|
|
||||||
NEXT_PAGE(level1_fixmap_pgt)
|
NEXT_PAGE(level1_fixmap_pgt)
|
||||||
.fill 512,8,0
|
.fill 512,8.0
|
||||||
```
|
```
|
||||||
|
|
||||||
Looks hard, but it isn't. First of all let's look at the `early_top_pgt`. It starts with the (4096 - 8) bytes of zeros, it means that we don't use the first `511` entries. And after this we can see one `level3_kernel_pgt` entry. Note that we subtract `__START_KERNEL_map + _PAGE_TABLE` from it. As we know `__START_KERNEL_map` is a base virtual address of the kernel text, so if we subtract `__START_KERNEL_map`, we will get physical address of the `level3_kernel_pgt`. Now let's look at `_PAGE_TABLE`, it is just page entry access rights:
|
Looks hard, but it isn't. First of all let's look at the `early_top_pgt`. It starts with the `4096` bytes of zeros (or `8192` bytes if `CONFIG_PAGE_TABLE_ISOLATION` is enabled), it means that we don't use the first `512` entries. And after this we can see `level3_kernel_pgt` entry. At the start of its definition, we can see that it is filled with the `4080` bytes of zeros (`L3_START_KERNEL` equals `510`). Subsequently, it stores two entries which map kernel space. Note that we subtract `__START_KERNEL_map` from `level2_kernel_pgt` and `level2_fixmap_pgt`. As we know `__START_KERNEL_map` is a base virtual address of the kernel text, so if we subtract `__START_KERNEL_map`, we will get physical addresses of the `level2_kernel_pgt` and `level2_fixmap_pgt`.
|
||||||
|
|
||||||
|
Next let's look at `_KERNPG_TABLE_NOENC` and `_PAGE_TABLE_NOENC`, these are just page entry access rights:
|
||||||
|
|
||||||
```C
|
```C
|
||||||
#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \
|
#define _KERNPG_TABLE_NOENC (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \
|
||||||
|
_PAGE_DIRTY)
|
||||||
|
#define _PAGE_TABLE_NOENC (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \
|
||||||
_PAGE_ACCESSED | _PAGE_DIRTY)
|
_PAGE_ACCESSED | _PAGE_DIRTY)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
The `level2_kernel_pgt` is page table entry which contains pointer to the page middle directory which maps kernel space. it calls the `PDMS` macro which creates `512` megabytes from the `__START_KERNEL_map` for kernel `.text` (after these `512` megabytes will be module memory space).
|
||||||
|
|
||||||
|
The `level2_fixmap_pgt` is a virtual addresses which can refer to any physical addresses even under kernel space. They are represented by the `4048` bytes of zeros, the `level1_fixmap_pgt` entry, `8` megabytes reserved for [vsyscalls](https://lwn.net/Articles/446528/) mapping and `2` megabytes of hole.
|
||||||
|
|
||||||
You can read more about it in the [Paging](https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-1.html) part.
|
You can read more about it in the [Paging](https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-1.html) part.
|
||||||
|
|
||||||
The `level3_kernel_pgt` - stores two entries which map kernel space. At the start of it's definition, we can see that it is filled with zeros `L3_START_KERNEL` or `510` times. Here the `L3_START_KERNEL` is the index in the page upper directory which contains `__START_KERNEL_map` address and it equals `510`. After this, we can see the definition of the two `level3_kernel_pgt` entries: `level2_kernel_pgt` and `level2_fixmap_pgt`. First is simple, it is page table entry which contains pointer to the page middle directory which maps kernel space and it has:
|
Now, after we saw the definitions of these symbols, let's get back to the code. Next we initialize last entry of `pgd` with `level3_kernel_pgt`:
|
||||||
|
|
||||||
```C
|
```C
|
||||||
#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \
|
pgd[pgd_index(__START_KERNEL_map)] = level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC;
|
||||||
_PAGE_DIRTY)
|
|
||||||
```
|
```
|
||||||
|
|
||||||
access rights. The second - `level2_fixmap_pgt` is a virtual addresses which can refer to any physical addresses even under kernel space. They represented by the one `level2_fixmap_pgt` entry and `10` megabytes hole for the [vsyscalls](https://lwn.net/Articles/446528/) mapping. The next `level2_kernel_pgt` calls the `PDMS` macro which creates `512` megabytes from the `__START_KERNEL_map` for kernel `.text` (after these `512` megabytes will be modules memory space).
|
All of `p*d` addresses may be wrong if the `startup_64` is not equal to default `0x1000000` address. Remember that the `load_delta` contains delta between the address of the `startup_64` symbol which was got during kernel [linking](https://en.wikipedia.org/wiki/Linker_%28computing%29) and the actual address. So we add the delta to the certain entries of the `p*d`.
|
||||||
|
|
||||||
Now, after we saw definitions of these symbols, let's get back to the code which is described at the beginning of the section. Remember that the `rbp` register contains delta between the address of the `startup_64` symbol which was got during kernel [linking](https://en.wikipedia.org/wiki/Linker_%28computing%29) and the actual address. So, for this moment, we just need to add this delta to the base address of some page table entries, that they'll have correct addresses. In our case these entries are:
|
```C
|
||||||
|
pgd[pgd_index(__START_KERNEL_map)] += load_delta;
|
||||||
```assembly
|
pud[510] += load_delta;
|
||||||
addq %rbp, early_top_pgt + (L4_START_KERNEL*8)(%rip)
|
pud[511] += load_delta;
|
||||||
addq %rbp, level3_kernel_pgt + (510*8)(%rip)
|
pmd[506] += load_delta;
|
||||||
addq %rbp, level3_kernel_pgt + (511*8)(%rip)
|
|
||||||
addq %rbp, level2_fixmap_pgt + (506*8)(%rip)
|
|
||||||
```
|
```
|
||||||
|
|
||||||
or the last entry of the `early_top_pgt` which is the `level3_kernel_pgt`, last two entries of the `level3_kernel_pgt` which are the `level2_kernel_pgt` and the `level2_fixmap_pgt` and five hundreds seventh entry of the `level2_fixmap_pgt` which is `level1_fixmap_pgt` page directory.
|
|
||||||
|
|
||||||
After all of this we will have:
|
After all of this we will have:
|
||||||
|
|
||||||
```
|
```
|
||||||
@ -192,80 +218,120 @@ early_top_pgt[511] -> level3_kernel_pgt[0]
|
|||||||
level3_kernel_pgt[510] -> level2_kernel_pgt[0]
|
level3_kernel_pgt[510] -> level2_kernel_pgt[0]
|
||||||
level3_kernel_pgt[511] -> level2_fixmap_pgt[0]
|
level3_kernel_pgt[511] -> level2_fixmap_pgt[0]
|
||||||
level2_kernel_pgt[0] -> 512 MB kernel mapping
|
level2_kernel_pgt[0] -> 512 MB kernel mapping
|
||||||
level2_fixmap_pgt[507] -> level1_fixmap_pgt
|
level2_fixmap_pgt[506] -> level1_fixmap_pgt
|
||||||
```
|
```
|
||||||
|
|
||||||
Note that we didn't fixup base address of the `early_top_pgt` and some of other page table directories, because we will see this during of building/filling of structures for these page tables. As we corrected base addresses of the page tables, we can start to build it.
|
Note that we didn't fixup base address of the `early_top_pgt` and some of other page table directories, because we will see this when building/filling structures of these page tables. As we corrected base addresses of the page tables, we can start to build it.
|
||||||
|
|
||||||
Identity mapping setup
|
Identity mapping setup
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
|
|
||||||
Now we can see the set up of identity mapping of early page tables. In Identity Mapped Paging, virtual addresses are mapped to physical addresses that have the same value, `1 : 1`. Let's look at it in detail. First of all we get the `rip-relative` address of the `_text` and `_early_top_pgt` and put they into `rdi` and `rbx` registers:
|
Now we can see the set up of identity mapping of early page tables. In Identity Mapped Paging, virtual addresses are mapped to physical addresses identically. Let's look at it in detail. First of all we replace `pud` and `pmd` with the pointer to first and second entry of `early_dynamic_pgts`:
|
||||||
|
|
||||||
```assembly
|
```C
|
||||||
leaq _text(%rip), %rdi
|
next_pgt_ptr = fixup_pointer(&next_early_pgt, physaddr);
|
||||||
leaq early_top_pgt(%rip), %rbx
|
pud = fixup_pointer(early_dynamic_pgts[(*next_pgt_ptr)++], physaddr);
|
||||||
|
pmd = fixup_pointer(early_dynamic_pgts[(*next_pgt_ptr)++], physaddr);
|
||||||
```
|
```
|
||||||
|
|
||||||
After this we store address of the `_text` in the `rax` and get the index of the page global directory entry which stores `_text` address, by shifting `_text` address on the `PGDIR_SHIFT`:
|
Let's look at the `early_dynamic_pgts` definition:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
movq %rdi, %rax
|
NEXT_PAGE(early_dynamic_pgts)
|
||||||
shrq $PGDIR_SHIFT, %rax
|
.fill 512*EARLY_DYNAMIC_PAGE_TABLES,8,0
|
||||||
```
|
```
|
||||||
|
|
||||||
where `PGDIR_SHIFT` is `39`. `PGDIR_SHFT` indicates the mask for page global directory bits in a virtual address. There are macro for all types of page directories:
|
which will store temporary page tables for early kernel.
|
||||||
|
|
||||||
|
Next we initialize `pgtable_flags` which will be used when initializing `p*d` entries later:
|
||||||
|
|
||||||
|
```C
|
||||||
|
pgtable_flags = _KERNPG_TABLE_NOENC + sme_get_me_mask();
|
||||||
|
```
|
||||||
|
|
||||||
|
`sme_get_me_mask` function returns `sme_me_mask` which was initialized in `sme_enable` function.
|
||||||
|
|
||||||
|
Next we fill two entries of `pgd` with `pud` plus `pgtable_flags` which we initialized above:
|
||||||
|
|
||||||
|
```C
|
||||||
|
i = (physaddr >> PGDIR_SHIFT) % PTRS_PER_PGD;
|
||||||
|
pgd[i + 0] = (pgdval_t)pud + pgtable_flags;
|
||||||
|
pgd[i + 1] = (pgdval_t)pud + pgtable_flags;
|
||||||
|
```
|
||||||
|
|
||||||
|
`PGDIR_SHFT` indicates the mask for page global directory bits in a virtual address. Here we calculate modulo with `PTRS_PER_PGD` (which expands to `512`) so as not to access the index greater than `512`. There are macro for all types of page directories:
|
||||||
|
|
||||||
```C
|
```C
|
||||||
#define PGDIR_SHIFT 39
|
#define PGDIR_SHIFT 39
|
||||||
|
#define PTRS_PER_PGD 512
|
||||||
#define PUD_SHIFT 30
|
#define PUD_SHIFT 30
|
||||||
|
#define PTRS_PER_PUD 512
|
||||||
#define PMD_SHIFT 21
|
#define PMD_SHIFT 21
|
||||||
|
#define PTRS_PER_PMD 512
|
||||||
```
|
```
|
||||||
|
|
||||||
After this we put the address of the first entry of the `early_dynamic_pgts` page table to the `rdx` register with the `_KERNPG_TABLE` access rights (see above) and fill the `early_top_pgt` with the 2 `early_dynamic_pgts` entries:
|
We do the almost same thing above:
|
||||||
|
|
||||||
```assembly
|
|
||||||
leaq (4096 + _KERNPG_TABLE)(%rbx), %rdx
|
|
||||||
movq %rdx, 0(%rbx,%rax,8)
|
|
||||||
movq %rdx, 8(%rbx,%rax,8)
|
|
||||||
```
|
|
||||||
|
|
||||||
The `rbx` register contains address of the `early_top_pgt` and `%rax * 8` here is index of a page global directory occupied by the `_text` address. So here we fill two entries of the `early_top_pgt` with address of two entries of the `early_dynamic_pgts` which is related to `_text`. The `early_dynamic_pgts` is array of arrays:
|
|
||||||
|
|
||||||
```C
|
```C
|
||||||
extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
|
i = (physaddr >> PUD_SHIFT) % PTRS_PER_PUD;
|
||||||
|
pud[i + 0] = (pudval_t)pmd + pgtable_flags;
|
||||||
|
pud[i + 1] = (pudval_t)pmd + pgtable_flags;
|
||||||
```
|
```
|
||||||
|
|
||||||
which will store temporary page tables for early kernel which we will not move to the `init_level4_pgt`.
|
Next we initialize `pmd_entry` and filter out unsupported `__PAGE_KERNEL_*` bits:
|
||||||
|
|
||||||
After this we add `4096` (size of the `early_top_pgt`) to the `rdx` (it now contains the address of the first entry of the `early_dynamic_pgts`) and put `rdi` (it now contains physical address of the `_text`) to the `rax`. Now we shift address of the `_text` ot `PUD_SHIFT` to get index of an entry from page upper directory which contains this address and clears high bits to get only `pud` related part:
|
```C
|
||||||
|
pmd_entry = __PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL;
|
||||||
```assembly
|
mask_ptr = fixup_pointer(&__supported_pte_mask, physaddr);
|
||||||
addq $4096, %rdx
|
pmd_entry &= *mask_ptr;
|
||||||
movq %rdi, %rax
|
pmd_entry += sme_get_me_mask();
|
||||||
shrq $PUD_SHIFT, %rax
|
pmd_entry += physaddr;
|
||||||
andl $(PTRS_PER_PUD-1), %eax
|
|
||||||
```
|
```
|
||||||
|
|
||||||
As we have index of a page upper directory we write two addresses of the second entry of the `early_dynamic_pgts` array to the first entry of this temporary page directory:
|
Next we fill all `pmd` entries to cover full size of the kernel:
|
||||||
|
|
||||||
```assembly
|
```C
|
||||||
movq %rdx, 4096(%rbx,%rax,8)
|
for (i = 0; i < DIV_ROUND_UP(_end - _text, PMD_SIZE); i++) {
|
||||||
incl %eax
|
int idx = i + (physaddr >> PMD_SHIFT) % PTRS_PER_PMD;
|
||||||
andl $(PTRS_PER_PUD-1), %eax
|
pmd[idx] = pmd_entry + i * PMD_SIZE;
|
||||||
movq %rdx, 4096(%rbx,%rax,8)
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
In the next step we do the same operation for last page table directory, but filling not two entries, but all entries to cover full size of the kernel.
|
Next we fixup the kernel text+data virtual addresses. Note that we might write invalid pmds, when the kernel is relocated (`cleanup_highmap` function fixes this up along with the mappings beyond `_end`).
|
||||||
|
|
||||||
After our early page table directories filled, we put physical address of the `early_top_pgt` to the `rax` register and jump to label `1`:
|
```C
|
||||||
|
pmd = fixup_pointer(level2_kernel_pgt, physaddr);
|
||||||
|
for (i = 0; i < PTRS_PER_PMD; i++) {
|
||||||
|
if (pmd[i] & _PAGE_PRESENT)
|
||||||
|
pmd[i] += load_delta;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Next we remove the memory encryption mask to obtain the true physical address (remember that `load_delta` includes the mask):
|
||||||
|
|
||||||
|
```C
|
||||||
|
*fixup_long(&phys_base, physaddr) += load_delta - sme_get_me_mask();
|
||||||
|
```
|
||||||
|
|
||||||
|
`phys_base` must match the first entry in `level2_kernel_pgt`.
|
||||||
|
|
||||||
|
As final step of `__startup_64` function, we encrypt the kernel (if SME is active) and return the SME encryption mask to be used as a modifier for the initial page directory entry programmed into `cr3` register:
|
||||||
|
|
||||||
|
```C
|
||||||
|
sme_encrypt_kernel(bp);
|
||||||
|
return sme_get_me_mask();
|
||||||
|
```
|
||||||
|
|
||||||
|
Now let's get back to assembly code. We prepare for next paragraph with following code:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
movq $(early_top_pgt - __START_KERNEL_map), %rax
|
addq $(early_top_pgt - __START_KERNEL_map), %rax
|
||||||
jmp 1f
|
jmp 1f
|
||||||
```
|
```
|
||||||
|
|
||||||
That's all for now. Our early paging is prepared and we just need to finish last preparation before we will jump into C code and kernel entry point later.
|
which adds physical address of `early_top_pgt` to `rax` register so that `rax` register contains sum of the address and the SME encryption mask.
|
||||||
|
|
||||||
|
That's all for now. Our early paging is prepared and we just need to finish last preparation before we will jump into kernel entry point.
|
||||||
|
|
||||||
Last preparation before jump at the kernel entry point
|
Last preparation before jump at the kernel entry point
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
@ -291,7 +357,7 @@ In the next step we check that CPU supports [NX](http://en.wikipedia.org/wiki/NX
|
|||||||
|
|
||||||
We put `0x80000001` value to the `eax` and execute `cpuid` instruction for getting the extended processor info and feature bits. The result will be in the `edx` register which we put to the `edi`.
|
We put `0x80000001` value to the `eax` and execute `cpuid` instruction for getting the extended processor info and feature bits. The result will be in the `edx` register which we put to the `edi`.
|
||||||
|
|
||||||
Now we put `0xc0000080` or `MSR_EFER` to the `ecx` and call `rdmsr` instruction for the reading model specific register.
|
Now we put `0xc0000080` or `MSR_EFER` to the `ecx` and execute `rdmsr` instruction for the reading model specific register.
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
movl $MSR_EFER, %ecx
|
movl $MSR_EFER, %ecx
|
||||||
@ -326,7 +392,14 @@ We will not see all fields in details here, but we will learn about this and oth
|
|||||||
1: wrmsr
|
1: wrmsr
|
||||||
```
|
```
|
||||||
|
|
||||||
If the [NX](https://en.wikipedia.org/wiki/NX_bit) bit is supported we enable `_EFER_NX` and write it too, with the `wrmsr` instruction. After the [NX](https://en.wikipedia.org/wiki/NX_bit) bit is set, we set some bits in the `cr0` [control register](https://en.wikipedia.org/wiki/Control_register), namely:
|
If the [NX](https://en.wikipedia.org/wiki/NX_bit) bit is supported we enable `_EFER_NX` and write it too, with the `wrmsr` instruction. After the [NX](https://en.wikipedia.org/wiki/NX_bit) bit is set, we set some bits in the `cr0` [control register](https://en.wikipedia.org/wiki/Control_register) with following assembly code:
|
||||||
|
|
||||||
|
```assembly
|
||||||
|
movl $CR0_STATE, %eax
|
||||||
|
movq %rax, %cr0
|
||||||
|
```
|
||||||
|
|
||||||
|
specifically the following bits:
|
||||||
|
|
||||||
* `X86_CR0_PE` - system is in protected mode;
|
* `X86_CR0_PE` - system is in protected mode;
|
||||||
* `X86_CR0_MP` - controls interaction of WAIT/FWAIT instructions with TS flag in CR0;
|
* `X86_CR0_MP` - controls interaction of WAIT/FWAIT instructions with TS flag in CR0;
|
||||||
@ -336,49 +409,37 @@ If the [NX](https://en.wikipedia.org/wiki/NX_bit) bit is supported we enable `_E
|
|||||||
* `X86_CR0_AM` - alignment check enabled if AM set, AC flag (in EFLAGS register) set, and privilege level is 3;
|
* `X86_CR0_AM` - alignment check enabled if AM set, AC flag (in EFLAGS register) set, and privilege level is 3;
|
||||||
* `X86_CR0_PG` - enable paging.
|
* `X86_CR0_PG` - enable paging.
|
||||||
|
|
||||||
by the execution following assembly code:
|
|
||||||
|
|
||||||
```assembly
|
|
||||||
#define CR0_STATE (X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | \
|
|
||||||
X86_CR0_NE | X86_CR0_WP | X86_CR0_AM | \
|
|
||||||
X86_CR0_PG)
|
|
||||||
movl $CR0_STATE, %eax
|
|
||||||
movq %rax, %cr0
|
|
||||||
```
|
|
||||||
|
|
||||||
We already know that to run any code, and even more [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) code from assembly, we need to setup a stack. As always, we are doing it by the setting of [stack pointer](https://en.wikipedia.org/wiki/Stack_register) to a correct place in memory and resetting [flags](https://en.wikipedia.org/wiki/FLAGS_register) register after this:
|
We already know that to run any code, and even more [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) code from assembly, we need to setup a stack. As always, we are doing it by the setting of [stack pointer](https://en.wikipedia.org/wiki/Stack_register) to a correct place in memory and resetting [flags](https://en.wikipedia.org/wiki/FLAGS_register) register after this:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
movq initial_stack(%rip), %rsp
|
movq initial_stack(%rip), %rsp
|
||||||
pushq $0
|
pushq $0
|
||||||
popfq
|
popfq
|
||||||
```
|
```
|
||||||
|
|
||||||
The most interesting thing here is the `initial_stack`. This symbol is defined in the [source](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head_64.S) code file and looks like:
|
The most interesting thing here is the `initial_stack`. This symbol is defined in the [source](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) code file and looks like:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
GLOBAL(initial_stack)
|
GLOBAL(initial_stack)
|
||||||
.quad init_thread_union + THREAD_SIZE - SIZEOF_PTREGS
|
.quad init_thread_union + THREAD_SIZE - SIZEOF_PTREGS
|
||||||
```
|
```
|
||||||
|
|
||||||
The `GLOBAL` is already familiar to us from. It defined in the [arch/x86/include/asm/linkage.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/linkage.h) header file expands to the `global` symbol definition:
|
The `THREAD_SIZE` macro is defined in the [arch/x86/include/asm/page_64_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/page_64_types.h) header file and depends on value of the `KASAN_STACK_ORDER` macro:
|
||||||
|
|
||||||
```C
|
```C
|
||||||
#define GLOBAL(name) \
|
#ifdef CONFIG_KASAN
|
||||||
.globl name; \
|
#define KASAN_STACK_ORDER 1
|
||||||
name:
|
#else
|
||||||
```
|
#define KASAN_STACK_ORDER 0
|
||||||
|
#endif
|
||||||
|
|
||||||
The `THREAD_SIZE` macro is defined in the [arch/x86/include/asm/page_64_types.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/page_64_types.h) header file and depends on value of the `KASAN_STACK_ORDER` macro:
|
|
||||||
|
|
||||||
```C
|
|
||||||
#define THREAD_SIZE_ORDER (2 + KASAN_STACK_ORDER)
|
#define THREAD_SIZE_ORDER (2 + KASAN_STACK_ORDER)
|
||||||
#define THREAD_SIZE (PAGE_SIZE << THREAD_SIZE_ORDER)
|
#define THREAD_SIZE (PAGE_SIZE << THREAD_SIZE_ORDER)
|
||||||
```
|
```
|
||||||
|
|
||||||
We consider when the [kasan](http://lxr.free-electrons.com/source/Documentation/kasan.txt) is disabled and the `PAGE_SIZE` is `4096` bytes. So the `THREAD_SIZE` will expands to `16` kilobytes and represents size of the stack of a thread. Why is `thread`? You may already know that each [process](https://en.wikipedia.org/wiki/Process_%28computing%29) may have parent [processes](https://en.wikipedia.org/wiki/Parent_process) and [child](https://en.wikipedia.org/wiki/Child_process) processes. Actually, a parent process and child process differ in stack. A new kernel stack is allocated for a new process. In the Linux kernel this stack is represented by the [union](https://en.wikipedia.org/wiki/Union_type#C.2FC.2B.2B) with the `thread_info` structure.
|
We consider when the [kasan](https://github.com/torvalds/linux/blob/master/Documentation/dev-tools/kasan.rst) is disabled and the `PAGE_SIZE` is `4096` bytes. So the `THREAD_SIZE` will expands to `16` kilobytes and represents size of the stack of a thread. Why is `thread`? You may already know that each [process](https://en.wikipedia.org/wiki/Process_%28computing%29) may have [parent processes](https://en.wikipedia.org/wiki/Parent_process) and [child processes](https://en.wikipedia.org/wiki/Child_process). Actually, a parent process and child process differ in stack. A new kernel stack is allocated for a new process. In the Linux kernel this stack is represented by the [union](https://en.wikipedia.org/wiki/Union_type#C.2FC.2B.2B) with the `thread_info` structure.
|
||||||
|
|
||||||
The `init_thread_union` is represented by the `thread_union`. And the `thread_union` is defined in the [include/linux/sched.h](https://github.com/torvalds/linux/blob/0500871f21b237b2bea2d9db405eadf78e5aab05/include/linux/sched.h) file like the following:
|
The `init_thread_union` is represented by the `thread_union`. And the `thread_union` is defined in the [include/linux/sched.h](https://github.com/torvalds/linux/blob/master/include/linux/sched.h) file like the following:
|
||||||
|
|
||||||
```C
|
```C
|
||||||
union thread_union {
|
union thread_union {
|
||||||
@ -392,28 +453,30 @@ union thread_union {
|
|||||||
};
|
};
|
||||||
```
|
```
|
||||||
|
|
||||||
The `CONFIG_ARCH_TASK_STRUCT_ON_STACK` kernel configuration option is only enabled for `ia64` architecture, and the `CONFIG_THREAD_INFO_IN_TASK` kernel configuration option is enabled for `x86_64` architecture. Thus the `thread_info` structure will be placed in `task_struct` structure instead of the `thread_union` union. So, as we consider only `x86_64` architecture in this book, an instance of `thread_union` will contain only stack and task.
|
The `CONFIG_ARCH_TASK_STRUCT_ON_STACK` kernel configuration option is only enabled for `ia64` architecture, and the `CONFIG_THREAD_INFO_IN_TASK` kernel configuration option is enabled for `x86_64` architecture. Thus the `thread_info` structure will be placed in `task_struct` structure instead of the `thread_union` union.
|
||||||
|
|
||||||
The `init_thread_union` is defined in the [include/asm-generic/vmlinux.lds.h](https://github.com/torvalds/blob/a6214385005333202c8cc1744c7075a9e1a26b9a/include/asm-generic/vmlinux.lds.h) file as part of the `INIT_TASK_DATA` macro like the following:
|
The `init_thread_union` is placed in the [include/asm-generic/vmlinux.lds.h](https://github.com/torvalds/blob/master/include/asm-generic/vmlinux.lds.h) file as part of the `INIT_TASK_DATA` macro like the following:
|
||||||
|
|
||||||
```C
|
```C
|
||||||
#define INIT_TASK_DATA(align) \
|
#define INIT_TASK_DATA(align) \
|
||||||
|
. = ALIGN(align); \
|
||||||
... \
|
... \
|
||||||
init_thread_union = .; \
|
init_thread_union = .; \
|
||||||
...
|
...
|
||||||
```
|
```
|
||||||
|
|
||||||
This macro is used in the [arch/x86/kernel/vmlinux.lds.S](https://github.com/torvalds/linux/blob/c62e43202e7cf50ca24bce58b255df7bf5de69d0/arch/x86/kernel/vmlinux.lds.S) file like the following:
|
This macro is used in the [arch/x86/kernel/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S) file like the following:
|
||||||
|
|
||||||
```
|
```
|
||||||
.data : AT(ADDR(.data) - LOAD_OFFSET) {
|
.data : AT(ADDR(.data) - LOAD_OFFSET) {
|
||||||
...
|
...
|
||||||
/* init_task */
|
|
||||||
INIT_TASK_DATA(THREAD_SIZE)
|
INIT_TASK_DATA(THREAD_SIZE)
|
||||||
...
|
...
|
||||||
} :data
|
} :data
|
||||||
```
|
```
|
||||||
|
|
||||||
|
That is, `init_thread_union` is initialized with the address which is aligned to `THREAD_SIZE` which is `16` kilobytes.
|
||||||
|
|
||||||
Now we may understand this expression:
|
Now we may understand this expression:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
@ -421,7 +484,7 @@ GLOBAL(initial_stack)
|
|||||||
.quad init_thread_union + THREAD_SIZE - SIZEOF_PTREGS
|
.quad init_thread_union + THREAD_SIZE - SIZEOF_PTREGS
|
||||||
```
|
```
|
||||||
|
|
||||||
that `initial_stack` symbol points to the start of the `thread_union.stack` array + `THREAD_SIZE` which is 16 killobytes and - `SIZEOF_PTREGS` which is 168 bytes. Here we need to subtract `168` bytes at the top of stack. This is necessary to guarantee illegal access of the next page memory.
|
that `initial_stack` symbol points to the start of the `thread_union.stack` array + `THREAD_SIZE` which is 16 killobytes and - `SIZEOF_PTREGS` which is convention which helps the in-kernel unwinder reliably detect the end of the stack.
|
||||||
|
|
||||||
After the early boot stack is set, to update the [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table) with the `lgdt` instruction:
|
After the early boot stack is set, to update the [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table) with the `lgdt` instruction:
|
||||||
|
|
||||||
@ -438,15 +501,11 @@ early_gdt_descr_base:
|
|||||||
.quad INIT_PER_CPU_VAR(gdt_page)
|
.quad INIT_PER_CPU_VAR(gdt_page)
|
||||||
```
|
```
|
||||||
|
|
||||||
We need to reload `Global Descriptor Table` because now kernel works in the low userspace addresses, but soon kernel will work in its own space. Now let's look at the definition of `early_gdt_descr`. Global Descriptor Table contains `32` entries:
|
We need to reload `Global Descriptor Table` because now kernel works in the low userspace addresses, but soon kernel will work in its own space.
|
||||||
|
|
||||||
```C
|
Now let's look at the definition of `early_gdt_descr`. `GDT_ENTRIES` expands to `32` so that Global Descriptor Table contains `32` entries for kernel code, data, thread local storage segments and etc...
|
||||||
#define GDT_ENTRIES 32
|
|
||||||
```
|
|
||||||
|
|
||||||
for kernel code, data, thread local storage segments and etc... it's simple. Now let's look at the definition of the `early_gdt_descr_base`.
|
Now let's look at the definition of `early_gdt_descr_base`. The `gdt_page` structure is defined in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h) as:
|
||||||
|
|
||||||
First of `gdt_page` defined as:
|
|
||||||
|
|
||||||
```C
|
```C
|
||||||
struct gdt_page {
|
struct gdt_page {
|
||||||
@ -454,7 +513,7 @@ struct gdt_page {
|
|||||||
} __attribute__((aligned(PAGE_SIZE)));
|
} __attribute__((aligned(PAGE_SIZE)));
|
||||||
```
|
```
|
||||||
|
|
||||||
in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/desc.h). It contains one field `gdt` which is array of the `desc_struct` structure which is defined as:
|
It contains one field `gdt` which is array of the `desc_struct` structure which is defined as:
|
||||||
|
|
||||||
```C
|
```C
|
||||||
struct desc_struct {
|
struct desc_struct {
|
||||||
@ -473,13 +532,15 @@ struct desc_struct {
|
|||||||
} __attribute__((packed));
|
} __attribute__((packed));
|
||||||
```
|
```
|
||||||
|
|
||||||
and presents familiar to us `GDT` descriptor. Also we can note that `gdt_page` structure aligned to `PAGE_SIZE` which is `4096` bytes. It means that `gdt` will occupy one page. Now let's try to understand what is `INIT_PER_CPU_VAR`. `INIT_PER_CPU_VAR` is a macro which defined in the [arch/x86/include/asm/percpu.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/percpu.h) and just concats `init_per_cpu__` with the given parameter:
|
which looks familiar `GDT` descriptor. Note that `gdt_page` structure is aligned to `PAGE_SIZE` which is `4096` bytes. Which means that `gdt` will occupy one page.
|
||||||
|
|
||||||
|
Now let's try to understand what `INIT_PER_CPU_VAR` is. `INIT_PER_CPU_VAR` is a macro which is defined in the [arch/x86/include/asm/percpu.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/percpu.h) and just concatenates `init_per_cpu__` with the given parameter:
|
||||||
|
|
||||||
```C
|
```C
|
||||||
#define INIT_PER_CPU_VAR(var) init_per_cpu__##var
|
#define INIT_PER_CPU_VAR(var) init_per_cpu__##var
|
||||||
```
|
```
|
||||||
|
|
||||||
After the `INIT_PER_CPU_VAR` macro will be expanded, we will have `init_per_cpu__gdt_page`. We can see in the [linker script](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/vmlinux.lds.S):
|
After the `INIT_PER_CPU_VAR` macro will be expanded, we will have `init_per_cpu__gdt_page`. We can see the initialization of `init_per_cpu__gdt_page` in the [linker script](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S):
|
||||||
|
|
||||||
```
|
```
|
||||||
#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load
|
#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load
|
||||||
@ -488,7 +549,7 @@ INIT_PER_CPU(gdt_page);
|
|||||||
|
|
||||||
As we got `init_per_cpu__gdt_page` in `INIT_PER_CPU_VAR` and `INIT_PER_CPU` macro from linker script will be expanded we will get offset from the `__per_cpu_load`. After this calculations, we will have correct base address of the new GDT.
|
As we got `init_per_cpu__gdt_page` in `INIT_PER_CPU_VAR` and `INIT_PER_CPU` macro from linker script will be expanded we will get offset from the `__per_cpu_load`. After this calculations, we will have correct base address of the new GDT.
|
||||||
|
|
||||||
Generally per-CPU variables is a 2.6 kernel feature. You can understand what it is from its name. When we create `per-CPU` variable, each CPU will have its own copy of this variable. Here we creating `gdt_page` per-CPU variable. There are many advantages for variables of this type, like there are no locks, because each CPU works with its own copy of variable and etc... So every core on multiprocessor will have its own `GDT` table and every entry in the table will represent a memory segment which can be accessed from the thread which ran on the core. You can read in details about `per-CPU` variables in the [Theory/per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) post.
|
Generally per-CPU variables is a 2.6 kernel feature. You can understand what it is from its name. When we create `per-CPU` variable, each CPU will have its own copy of this variable. Here we are creating `gdt_page` per-CPU variable. There are many advantages for variables of this type, like there are no locks, because each CPU works with its own copy of variable and etc... So every core on multiprocessor will have its own `GDT` table and every entry in the table will represent a memory segment which can be accessed from the thread which ran on the core. You can read in details about `per-CPU` variables in the [Concepts/per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) post.
|
||||||
|
|
||||||
As we loaded new Global Descriptor Table, we reload segments as we did it every time:
|
As we loaded new Global Descriptor Table, we reload segments as we did it every time:
|
||||||
|
|
||||||
@ -516,18 +577,21 @@ where `MSR_GS_BASE` is:
|
|||||||
#define MSR_GS_BASE 0xc0000101
|
#define MSR_GS_BASE 0xc0000101
|
||||||
```
|
```
|
||||||
|
|
||||||
We need to put `MSR_GS_BASE` to the `ecx` register and load data from the `eax` and `edx` (which are point to the `initial_gs`) with `wrmsr` instruction. We don't use `cs`, `fs`, `ds` and `ss` segment registers for addressing in the 64-bit mode, but `fs` and `gs` registers can be used. `fs` and `gs` have a hidden part (as we saw it in the real mode for `cs`) and this part contains descriptor which mapped to [Model Specific Registers](https://en.wikipedia.org/wiki/Model-specific_register). So we can see above `0xc0000101` is a `gs.base` MSR address. When a [system call](https://en.wikipedia.org/wiki/System_call) or [interrupt](https://en.wikipedia.org/wiki/Interrupt) occurred, there is no kernel stack at the entry point, so the value of the `MSR_GS_BASE` will store address of the interrupt stack.
|
We need to put `MSR_GS_BASE` to the `ecx` register and load data from the `eax` and `edx` (which point to the `initial_gs`) with `wrmsr` instruction. We don't use `cs`, `fs`, `ds` and `ss` segment registers for addressing in the 64-bit mode, but `fs` and `gs` registers can be used. `fs` and `gs` have a hidden part (as we saw it in the real mode for `cs`) and this part contains a descriptor which is mapped to [Model Specific Registers](https://en.wikipedia.org/wiki/Model-specific_register). So we can see above `0xc0000101` is a `gs.base` MSR address. When a [system call](https://en.wikipedia.org/wiki/System_call) or [interrupt](https://en.wikipedia.org/wiki/Interrupt) occurs, there is no kernel stack at the entry point, so the value of the `MSR_GS_BASE` will store address of the interrupt stack.
|
||||||
|
|
||||||
In the next step we put the address of the real mode bootparam structure to the `rdi` (remember `rsi` holds pointer to this structure from the start) and jump to the C code with:
|
In the next step we put the address of the real mode bootparam structure to the `rdi` (remember `rsi` holds pointer to this structure from the start) and jump to the C code with:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
|
pushq $.Lafter_lret # put return address on stack for unwinder
|
||||||
|
xorq %rbp, %rbp # clear frame pointer
|
||||||
movq initial_code(%rip), %rax
|
movq initial_code(%rip), %rax
|
||||||
pushq $__KERNEL_CS # set correct cs
|
pushq $__KERNEL_CS # set correct cs
|
||||||
pushq %rax # target address in negative space
|
pushq %rax # target address in negative space
|
||||||
lretq
|
lretq
|
||||||
|
.Lafter_lret:
|
||||||
```
|
```
|
||||||
|
|
||||||
Here we put the address of the `initial_code` to the `rax` and push fake address, `__KERNEL_CS` and the address of the `initial_code` to the stack. After this we can see `lretq` instruction which means that after it return address will be extracted from stack (now there is address of the `initial_code`) and jump there. `initial_code` is defined in the same source code file and looks:
|
Here we put the address of the `initial_code` to the `rax` and push the return address, `__KERNEL_CS` and the address of the `initial_code` to the stack. After this we can see `lretq` instruction which means that after it return address will be extracted from stack (now there is address of the `initial_code`) and jump there. `initial_code` is defined in the same source code file and looks:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
.balign 8
|
.balign 8
|
||||||
@ -538,10 +602,11 @@ Here we put the address of the `initial_code` to the `rax` and push fake address
|
|||||||
...
|
...
|
||||||
```
|
```
|
||||||
|
|
||||||
As we can see `initial_code` contains address of the `x86_64_start_kernel`, which is defined in the [arch/x86/kerne/head64.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head64.c) and looks like this:
|
As we can see `initial_code` contains address of the `x86_64_start_kernel`, which is defined in the [arch/x86/kerne/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c) and looks like this:
|
||||||
|
|
||||||
```C
|
```C
|
||||||
asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data) {
|
asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
|
||||||
|
{
|
||||||
...
|
...
|
||||||
...
|
...
|
||||||
...
|
...
|
||||||
@ -550,12 +615,10 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data) {
|
|||||||
|
|
||||||
It has one argument is a `real_mode_data` (remember that we passed address of the real mode data to the `rdi` register previously).
|
It has one argument is a `real_mode_data` (remember that we passed address of the real mode data to the `rdi` register previously).
|
||||||
|
|
||||||
This is first C code in the kernel!
|
|
||||||
|
|
||||||
Next to start_kernel
|
Next to start_kernel
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
|
|
||||||
We need to see last preparations before we can see "kernel entry point" - start_kernel function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c#L489).
|
We need to see last preparations before we can see "kernel entry point" - start_kernel function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c).
|
||||||
|
|
||||||
First of all we can see some checks in the `x86_64_start_kernel` function:
|
First of all we can see some checks in the `x86_64_start_kernel` function:
|
||||||
|
|
||||||
@ -566,11 +629,11 @@ BUILD_BUG_ON(MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE);
|
|||||||
BUILD_BUG_ON((__START_KERNEL_map & ~PMD_MASK) != 0);
|
BUILD_BUG_ON((__START_KERNEL_map & ~PMD_MASK) != 0);
|
||||||
BUILD_BUG_ON((MODULES_VADDR & ~PMD_MASK) != 0);
|
BUILD_BUG_ON((MODULES_VADDR & ~PMD_MASK) != 0);
|
||||||
BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL));
|
BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL));
|
||||||
BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) == (__START_KERNEL & PGDIR_MASK)));
|
MAYBE_BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) == (__START_KERNEL & PGDIR_MASK)));
|
||||||
BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);
|
BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);
|
||||||
```
|
```
|
||||||
|
|
||||||
There are checks for different things like virtual addresses of modules space is not fewer than base address of the kernel text - `__STAT_KERNEL_map`, that kernel text with modules is not less than image of the kernel and etc... `BUILD_BUG_ON` is a macro which looks as:
|
There are checks for different things like virtual address of module space is not fewer than base address of the kernel text - `__STAT_KERNEL_map`, that kernel text with modules is not less than image of the kernel and etc... `BUILD_BUG_ON` is a macro which looks as:
|
||||||
|
|
||||||
```C
|
```C
|
||||||
#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)]))
|
#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)]))
|
||||||
@ -586,21 +649,24 @@ That's all. So interesting C trick for getting compile error which depends on so
|
|||||||
In the next step we can see call of the `cr4_init_shadow` function which stores shadow copy of the `cr4` per cpu. Context switches can change bits in the `cr4` so we need to store `cr4` for each CPU. And after this we can see call of the `reset_early_page_tables` function where we resets all page global directory entries and write new pointer to the PGT in `cr3`:
|
In the next step we can see call of the `cr4_init_shadow` function which stores shadow copy of the `cr4` per cpu. Context switches can change bits in the `cr4` so we need to store `cr4` for each CPU. And after this we can see call of the `reset_early_page_tables` function where we resets all page global directory entries and write new pointer to the PGT in `cr3`:
|
||||||
|
|
||||||
```C
|
```C
|
||||||
for (i = 0; i < PTRS_PER_PGD-1; i++)
|
memset(early_top_pgt, 0, sizeof(pgd_t)*(PTRS_PER_PGD-1));
|
||||||
early_top_pgt[i].pgd = 0;
|
next_early_pgt = 0;
|
||||||
|
write_cr3(__sme_pa_nodebug(early_top_pgt));
|
||||||
next_early_pgt = 0;
|
|
||||||
|
|
||||||
write_cr3(__pa_nodebug(early_top_pgt));
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Soon we will build new page tables. Here we can see that we go through all Page Global Directory Entries (`PTRS_PER_PGD` is `512`) in the loop and make it zero. After this we set `next_early_pgt` to zero (we will see details about it in the next post) and write physical address of the `early_top_pgt` to the `cr3`. `__pa_nodebug` is a macro which will be expanded to:
|
Soon we will build new page tables. Here we can see that we zero all Page Global Directory entries. After this we set `next_early_pgt` to zero (we will see details about it in the next post) and write physical address of the `early_top_pgt` to the `cr3`.
|
||||||
|
|
||||||
```C
|
After this we clear `_bss` from the `__bss_stop` to `__bss_start` and also clear `init_top_pgt`. `init_top_pgt` is defined in the [arch/x86/kerne/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) like the following:
|
||||||
((unsigned long)(x) - __START_KERNEL_map + phys_base)
|
|
||||||
|
```assembly
|
||||||
|
NEXT_PGD_PAGE(init_top_pgt)
|
||||||
|
.fill 512,8,0
|
||||||
|
.fill PTI_USER_PGD_FILL,8,0
|
||||||
```
|
```
|
||||||
|
|
||||||
After this we clear `_bss` from the `__bss_stop` to `__bss_start` and the next step will be setup of the early `IDT` handlers, but it's big concept so we will see it in the next part.
|
This is exactly the same definition as `early_top_pgt`.
|
||||||
|
|
||||||
|
The next step will be setup of the early `IDT` handlers, but it's big concept so we will see it in the next post.
|
||||||
|
|
||||||
Conclusion
|
Conclusion
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
|
Loading…
Reference in New Issue
Block a user