This is the third part of the Linux kernel initialization process series. In the previous [part](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-2.md) we saw early interrupt and exception handling and will continue diving into the Linux kernel initialization process in the current part. Our next stop is 'kernel entry point' - `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) source code file. Yes, technically it is not kernel's entry point but the start of the generic kernel code which does not depend on certain architecture. But before we call the `start_kernel` function, we must do some preparations. So let's continue.
In the previous part we stopped at setting Interrupt Descriptor Table and loading it in the `IDTR` register. At the next step after this we can see a call to the `copy_bootdata` function:
This function takes one argument - virtual address of the `real_mode_data`. Remember that we passed the address of the `boot_params` structure from [arch/x86/include/uapi/asm/bootparam.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/uapi/asm/bootparam.h#L114) to the `x86_64_start_kernel` function as the first argument in [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head_64.S):
Now let's look at the `__va` macro. This macro is defined in [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c):
where `PAGE_OFFSET` is `__PAGE_OFFSET` which is `0xffff880000000000` and the base virtual address of the direct mapping of all physical memory. So we're getting virtual address of the variable `boot_params` that comes along from real mode, and pass it to the `copy_bootdata` function, where we copy `real_mode_data` to the `boot_params` being defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/d9919d43cbf6790d2bc0c0a2743c51fc25f26919/arch/x86/kernel/setup.c)
First of all, note that this function is declared with `__init` prefix. It means that this function will be used only during the initialization and memory used will be freed.
We can see declarations of two variables for the kernel command line and copying `real_mode_data` to the `boot_params` using the `memcpy` function. The next call of the `sanitize_boot_params` function fills some fields of the `boot_params` structure such as `ext_ramdisk_image` etc. If there are unknown fields in `boot_params` to this bootloader, they are initialized to zero. After this we're getting address of the command line with a call of the `get_cmd_line_ptr` function:
that gets the 64-bit address of the command line from the kernel boot header and returns it. In the last step we check `cmd_line_ptr`, getting its virtual address and copy it to the `boot_command_line` which is just an array of bytes:
After this we will have copied kernel command line and `boot_params` structure. In the next step we can see a call of the `load_ucode_bsp` function that loads processor microcode, but we will not see it here.
After microcode has been loaded we can see the check of the `console_loglevel` and the `early_printk` function which prints `Kernel Alive` string. But you'll never see this output because `early_printk` is not initialized yet. It is a minor bug in the kernel and i sent the patch - [commit](http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=91d8f0416f3989e248d3a3d3efb821eda10a85d2) and you will see it in the mainline soon. So you can skip this code.
In the next step, as we have copied `boot_params` structure, we need to move from the early page tables to the page tables for the initialization process. We already set early page tables for switchover (you can read about it in the previous [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-1)) and dropped it all in the `reset_early_page_tables` function (you can read about it in the previous part as well) and kept only kernel high mapping. After this we call:
function and pass `init_level4_pgt` which is also defined in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head_64.S) and looks as follows:
which maps first 2 gigabytes and 512 megabytes for the kernel code, data and bss. `clear_page` function defined in the [arch/x86/lib/clear_page_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/lib/clear_page_64.S) let's look at this function:
As you can understand from the function name it clears or fills with zeros the page tables. First of all, note that this function starts with the `CFI_STARTPROC` and `CFI_ENDPROC` which expand to GNU assembly directives:
and are used for debugging. After `CFI_STARTPROC` macro we zero out the `eax` register and load 64 to the `ecx` (it will be a counter). Next, we can see a loop that starts with the `.Lloop` label and starts decrementing the `ecx` counter. After it is done we move zero from the `rax` register to the `rdi` containing the base address of the `init_level4_pgt` now and do the same procedure seven times but every time move `rdi` offset by 8. After this, we will have first 64 bytes of the `init_level4_pgt` filled with zeros. In the next step we put the address of the `init_level4_pgt` at 64-bytes offset to the `rdi` again and repeat all operations until `ecx` reaches zero. In the end we will have `init_level4_pgt` filled with zeros.
function with the `real_mode_data` as argument. The `x86_64_start_reservations` function is defined in the same source code file as the `x86_64_start_kernel` function and looks as follows:
In the next step we can see the call of the `reserve_ebda_region` function which is defined in the [arch/x86/kernel/head.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head.c). This function reserves a memory block for the `EBDA` or Extended BIOS Data Area. The Extended BIOS Data Area is located in the top of the conventional memory and contains data about ports, disk parameters and etc...
we exit from the `reserve_ebda_region` function if paravirtualization is enabled because in such case the extended BIOS data area is absent. In the next step we need to get the end of the low memory:
We're getting the virtual address of the BIOS low memory in kilobytes and convert it to bytes by shifting it 10 times (multiply by 1024 in other words). After this, we need to get the address of the extended BIOS data with
where `get_bios_ebda` function is defined in the [arch/x86/include/asm/bios_ebda.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/bios_ebda.h) and looks like:
Let's try to understand how it works. Here we can see that we are converting physical address `0x40E` to the virtual, where `0x0040:0x000e` is the segment which contains base address of the extended BIOS data area. Don't worry that we are using `phys_to_virt` function for converting a physical address to virtual address. You can note that previously we have used `__va` macro for the same point, but `phys_to_virt` is the same:
This configuration option is enabled by `CONFIG_PHYS_ADDR_T_64BIT`. After that we got virtual address of the segment which stores the base address of the extended BIOS data area, we shift it by 4 and return. After this `ebda_addr` variables contains the base address of the extended BIOS data area.
or 128 kilobytes. In the last step we get lower part in the low memory and extended BIOS data area and call `memblock_reserve` function which will reserve memory region for extended BIOS data between low memory and one megabyte mark:
and reserves memory region for the given base address and size. `memblock_reserve` is the first function in this book from Linux kernel memory manager framework. We will take a closer look on memory manager soon, but now let's look at its implementation.
In the previous paragraph we stopped at the call of the `memblock_reserve` function and as I said before it is the first function from the memory manager framework. Let's try to understand how it works. `memblock_reserve` function just calls:
and describes a generic memory block. You can see that we initialize `_rgn` by assigning it to the address of the `memblock.reserved`. `memblock` is a global variable that looks as follows:
We will not dive into details of this variable now, but rather dive into them later in the parts concerning the memory manager. Just note that `memblock` variable defined with the `__initdata_memblock` which is:
From this we can conclude that all memory blocks will be in the `.meminit.data` section. After we defined `_rgn` we print information about it with `memblock_dbg` macros. You can enable it by passing `memblock=debug` to the kernel command line.
After debugging lines were printed next is the call of the following function:
which adds new memory block region into the `.meminit.data` section. As we do not initialize `_rgn` but it just contains `&memblock.reserved`, we just fill passed `_rgn` with the base address of the extended BIOS data area region, size of this region and flags:
NUMA node id depends on `MAX_NUMNODES` macro which is defined in the [include/linux/numa.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/numa.h):
After this we will have first reserved `memblock` for the extended BIOS data area in the `.meminit.data` section. `reserve_ebda_region` function finished its work on this step and we can go back to the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head64.c).
It is the end of the third part about Linux kernel insides. In next part we will see the first initialization steps in the kernel entry point - `start_kernel` function. It will be the first step before we will see the launch of the first `init` process.
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**