You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

30 KiB

Kernel booting process. Part 4.

Transition to 64-bit mode

It is the fourth part of the Kernel booting process and we will see first steps in the protected mode, like checking that cpu supports the long mode and SSE, paging and initialization of the page tables and transition to the long mode in in the end of this part.

NOTE: will be much assembly code in this part, so if you have poor knowledge, read a book about it

In the previous part we stopped at the jump to the 32-bit entry point in the arch/x86/boot/pmjump.S:

jmpl	*%eax

Recall that eax register contains the address of the 32-bit entry point. We can read about this point from the linux kernel x86 boot protocol:

When using bzImage, the protected-mode kernel was relocated to 0x100000

And now we can make sure that it is true. Let's look on registers value in 32-bit entry point:

eax            0x100000	1048576
ecx            0x0	    0
edx            0x0	    0
ebx            0x0	    0
esp            0x1ff5c	0x1ff5c
ebp            0x0	    0x0
esi            0x14470	83056
edi            0x0	    0
eip            0x100000	0x100000
eflags         0x46	    [ PF ZF ]
cs             0x10	16
ss             0x18	24
ds             0x18	24
es             0x18	24
fs             0x18	24
gs             0x18	24

We can see here that cs register contains - 0x10 (as you can remember from the previous part, it is the second index in the Global Descriptor Table), eip register is 0x100000 and base address of the all segments include code segment is zero. So we can get physical address, it will be 0:0x100000 or just 0x100000, as in boot protocol. Now let's start with 32-bit entry point.

32-bit entry point

We can find the definition of the 32-bit entry point in the arch/x86/boot/compressed/head_64.S assembly source code file:

	__HEAD
	.code32
ENTRY(startup_32)
....
....
....
ENDPROC(startup_32)

First of all why compressed directory? Actually bzimage is a gzipped vmlinux + header + kernel setup code. We saw the kernel setup code in all of the previous parts. So, the main goal of the head_64.S is to prepare for entering long mode, enter into it and decompress the kernel. We will see all of these steps besides kernel decompression in this part.

Also you can note that there are two files in the arch/x86/boot/compressed directory:

We will see only head_64.S because as you may remember this book is only x86_64 related. The head_32.S even not compiled in our case. Let's look at arch/x86/boot/compressed/Makefile script. We can see there the following target:

vmlinux-objs-y := $(obj)/vmlinux.lds $(obj)/head_$(BITS).o $(obj)/misc.o \
	$(obj)/string.o $(obj)/cmdline.o \
	$(obj)/piggy.o $(obj)/cpuflags.o

Note on $(obj)/head_$(BITS).o. It means that compilation of the head_{32,64}.o depends on value of the $(BITS). We can find it in the other Makefile - arch/x86/kernel/Makefile:

ifeq ($(CONFIG_X86_32),y)
	    BITS := 32
        ...
		...
else
		...
		...
        BITS := 64
endif

Now we know where to start, so let's do it.

Reload the segments if needed

As I wrote above, we start in arch/x86/boot/compressed/head_64.S assembly source code file. First of all we can see definition of special section attribute before the startup_32 definition:

    __HEAD
	.code32
ENTRY(startup_32)

The __HEAD is macro which is defined in include/linux/init.h header file and expands to the definition of the following section:

#define __HEAD		.section	".head.text","ax"

with .head.text name and ax flags. In our case, these flags show us that this section is executable or in other words contains code. We can find definition of this section in the arch/x86/boot/compressed/vmlinux.lds.S linker script:

SECTIONS
{
	. = 0;
	.head.text : {
		_head = . ;
		HEAD_TEXT
		_ehead = . ;
	}

If you are not familiar with syntax of GNU LD linker scripting language, you can find more information in the documentation. In short words, the . symbol is a special variable of linker - location counter. The value assigned to it is an offset relative to the offset of the segment. In our case, we assign zero to location counter. This means that that our code is linked to run from the 0 offset in memory. Moreover, we can find this information in comments:

Be careful parts of head_64.S assume startup_32 is at address 0.

Ok, now we know where we are, and now is the best time to look inside the startup_32 function.

In the beginning of the startup_32 function, we can see the cld instruction which clears the DF bit in the flags register. When direction flag is clear, all string operations like stos, scas and others will increment the index registers esi or edi. We need to clear direction flag, because later, we will use strings operations, for clearing space for page tables and etc.

After we have cleared the DF bit, next step is the check of the KEEP_SEGMENTS flag from loadflags kernel setup header field. If you remember we already saw loadflags in the very first part of this book. In that time, we checked CAN_USE_HEAP flag to get ability to use heap. Now we need to check the KEEP_SEGMENTS flag. This flags is described in the linux boot protocol documentation:

Bit 6 (write): KEEP_SEGMENTS
  Protocol: 2.07+
  - If 0, reload the segment registers in the 32bit entry point.
  - If 1, do not reload the segment registers in the 32bit entry point.
    Assume that %cs %ds %ss %es are all set to flat segments with
	a base of 0 (or the equivalent for their environment).

So, if the KEEP_SEGMENTS bit is not set in the loadflags, we need to reset ds, ss and es segment registers to a flat segment with base 0. That we do:

	testb $(1 << 6), BP_loadflags(%esi)
	jnz 1f

	cli
	movl	$(__BOOT_DS), %eax
	movl	%eax, %ds
	movl	%eax, %es
	movl	%eax, %ss

Remember that the __BOOT_DS is 0x18 (index of data segment in the Global Descriptor Table). If KEEP_SEGMENTS is set, we jump to the nearest 1f label or update segment registers with __BOOT_DS if it is not set. It is pretty easy, but here is one interesting moment. If you've read the previous part, you may remember that we already updated these segment registers right after we have moved to the protected mode in the arch/x86/boot/pmjump.S. So why do we need to care about values of segment registers again? The answer is easy too. Actually, the Linux kernel also has the 32-bit boot protocol and if a bootloader uses it to load the Linux kernel, all code before the startup_32 will be missed. In this case, the startup_32 will be first entry point of the Linux kernel right after bootloader and there are no guarantees that segment registers will be in known state.

After we have checked the KEEP_SEGMENTS flag and put the correct value to the segment registers, the next step is to calculate difference between where we loaded and compiled to run. Remember that setup.ld.S contains following deifnition: . = 0 at the start of the .head.text section. This means that the code in this section is compiled to run from 0 address. We can see this in objdump output:

arch/x86/boot/compressed/vmlinux:     file format elf64-x86-64


Disassembly of section .head.text:

0000000000000000 <startup_32>:
   0:   fc                      cld
   1:   f6 86 11 02 00 00 40    testb  $0x40,0x211(%rsi)

The objdump util tells us that the address of the startup_32 is 0. But actually it is not so. Our current goal is to know where actually we are. It is pretty simple to do in the long mode, because it support rip relative addressing, but now we are in the protected mode. We will use common pattern to know the address of the startup_32. We need to define a label and make a call to this label and pop the top of the stack to a register:

call label
label: pop %reg

After this a register will contain the address of a label. Let's look to the similar code which search address of the startup_32 in the Linux kernel:

	leal	(BP_scratch+4)(%esi), %esp
	call	1f
1:  popl	%ebp
	subl	$1b, %ebp

As you can remember from the previous part, the esi register contains the address of the boot_params structure which was filled before we moved to the protected mode. The boot_params structure contains a special field scratch with offset 0x1e4. These four bytes field will be temporary stack for call instruction. We are getting the address of the scratch field + 4 bytes and puting it in the esp register. We add 4 bytes to the base of the BP_scratch field because as I just described it will be temporary stack and the stack grows from top to down in x86_64 architecture. So our stack pointer will point to the correct top of stack. After this we can see the pattern that I've described above. We make a call to the 1f label and put the address of this label to the ebp register, because we have return address on the top of stack after the call instruction will be executed. So, for now we have an address of the 1f label and now it is easy to get address of the startup_32. We need just to subtract address of label from the address which we got from the stack:

startup_32 (0x0) +-----------------------+ | | | | | | | | | | | | | | | | 1f (0x0 + 1f offset) +-----------------------+ %ebp - real physical address | | | | +-----------------------+

The startup_32 is linked to run at 0x0 address and this means that 1f has 0x0 + offset to 1f addres. Actually it is something abotu 0x22 bytes. The ebp register contains real physical address of the 1f label. So, if we will substract 1f from the ebp we will get real physical address of the startup_32. The Linux kernel boot protocol describes that the base of the protected mode kernel is 0x100000. We can verify it gdb. Let's start debugger and put breakpoint to the 1f address which is 0x100022. If we are right, we must see 0x100022 in the ebp register:

$ gdb
(gdb)$ target remote :1234
Remote debugging using :1234
0x0000fff0 in ?? ()
(gdb)$ br *0x100022
Breakpoint 1 at 0x100022
(gdb)$ c
Continuing.

Breakpoint 1, 0x00100022 in ?? ()
(gdb)$ i r
eax            0x18	0x18
ecx            0x0	0x0
edx            0x0	0x0
ebx            0x0	0x0
esp            0x144a8	0x144a8
ebp            0x100021	0x100021
esi            0x142c0	0x142c0
edi            0x0	0x0
eip            0x100022	0x100022
eflags         0x46	[ PF ZF ]
cs             0x10	0x10
ss             0x18	0x18
ds             0x18	0x18
es             0x18	0x18
fs             0x18	0x18
gs             0x18	0x18

If we will execute next instruction which is subl $1b, %ebp, we will see:

nexti
...
ebp            0x100000	0x100000
...

Ok, that's true. The address of the startup_32 is 0x100000. After we have knew the address of the startup_32 label, we can start to do preparation before the transition to long mode. Our next goal is to setup the stack and verify that the CPU supports long mode and SSE.

Stack setup and CPU verification

We could not setup the stack while we did not know the address of the startup_32 label. We can imagine the stack as an array and the stack pointer register esp must point to the end of this array. Of course we can define an array in our code, but we need to know its actual address to configure stack pointer in a correct way. Let's look at the code:

	movl	$boot_stack_end, %eax
	addl	%ebp, %eax
	movl	%eax, %esp

The boots_stack_end defined in the same arch/x86/boot/compressed/head_64.S assembly source code file and located in the .bss section:

	.bss
	.balign 4
boot_heap:
	.fill BOOT_HEAP_SIZE, 1, 0
boot_stack:
	.fill BOOT_STACK_SIZE, 1, 0
boot_stack_end:

First of all we put the address of boot_stack_end into the eax register. From now the eax register will contain address of the boot_stack_end where it was linked or in other words 0x0 + boot_stack_end. To get real address of the boot_stack_end we need to add real address of the startup_32. As you remember, we have found this address above and put it to the ebp register. In the end, the register eax will contain real address of the boot_stack_end and we just need to put to the stack pointer.

After we have set up the stack, next step is CPU verification. As we are going to execute transition to the long mode, we need to check that the CPU supports long mode and SSE. We will do it by the call of the verify_cpu function:

	call	verify_cpu
	testl	%eax, %eax
	jnz	no_longmode

This function defined in the arch/x86/kernel/verify_cpu.S assembly file and just contains a couple of call cpuid instruction. This instruction is used for getting information about the processor. In our case it checks long mode and SSE support and returns 0 on success or 1 on fail in the eax register.

If the value of the eax is not zero, we jump to the no_longmode label which just stops the CPU by the call of the hlt instruction while any hardware interrupt will not happen:

no_longmode:
1:
	hlt
	jmp     1b

If the value of the eax register is zero, everything is ok and are able to go next.

Calculate relocation address

The next step is calculating relocation address for decompression if needed. First of all we need to know what does it mean relocatable kernel. We already know that the base address of the 32-bit entry point of the Linux kernel is 0x100000. But it is only 32-bit entry point. Default base address of the Linux kernel determined by the value of the CONFIG_PHYSICAL_START kernel configuration option and it's default value is - 0x1000000 or 1 MB. The main problem here is that if the Linux kernel will crash, a kernel developer must have a rescue kernel for kdump which is configured to load at a different address. The Linux kernel provides special configuration option to solve this problem - CONFIG_RELOCATABLE. As we can read in the documentation of the Linux kernel:

This builds a kernel image that retains relocation information
so it can be loaded someplace besides the default 1MB.

Note: If CONFIG_RELOCATABLE=y, then the kernel runs from the address
it has been loaded at and the compile time physical address
(CONFIG_PHYSICAL_START) is used as the minimum location.

In simple words this means that the Linux kernel with the same configuration can be booted at different addresses. Technically, this is done by the compiling decompressor as position independent code. If we will look into the /arch/x86/boot/compressed/Makefile, we will see that the decompressor is compiled with the -fPIC flag:

KBUILD_CFLAGS += -fno-strict-aliasing -fPIC

When we are using position-independent code an address obtained by adding the address field of the command and the value of the program counter. We can load a code which is uses such addressing from any address. That's why we had to get real physical address of the startup_32. Now let's back to the Linux kernel code. Our current goal is to calculate an address where to relocate the kernel for decompression. Caclulation of this address depends on CONFIG_RELOCATABLE kernel configuration option. Let's look on the code:

#ifdef CONFIG_RELOCATABLE
	movl	%ebp, %ebx
	movl	BP_kernel_alignment(%esi), %eax
	decl	%eax
	addl	%eax, %ebx
	notl	%eax
	andl	%eax, %ebx
	cmpl	$LOAD_PHYSICAL_ADDR, %ebx
	jge	1f
#endif
	movl	$LOAD_PHYSICAL_ADDR, %ebx
1:
	addl	$z_extract_offset, %ebx

We remember that value of the ebp register is the physical address of the startup_32 label. If the CONFIG_RELOCATABLE kernel configuration option is enabled during kernel configuration, we put this address to the ebx register, align it to the 2M and compare it with the LOAD_PHYSICAL_ADDR value. The LOAD_PHYSICAL_ADDR macro defined in the arch/x86/include/asm/boot.h header file and it looks like this:

#define LOAD_PHYSICAL_ADDR ((CONFIG_PHYSICAL_START \
				+ (CONFIG_PHYSICAL_ALIGN - 1)) \
				& ~(CONFIG_PHYSICAL_ALIGN - 1))

As we can see it just expands to the aligned CONFIG_PHYSICAL_ALIGN value which represents physical address where to load kernel. After comparision of the LOAD_PHYSICAL_ADDR and value of the ebx register, we add offset from the startup_32 where to relocate compressed kernel image to decompress it. If the CONFIG_RELOCATABLE option is not enabled during kernel configuration, we just put default address where to load kernel and add z_extract_offset to it.

After all of this calculation we will have ebp which contains the address where we loaded and ebx with the address to which the kernel will be moved for decompression.

Preparation before entering long mode

After we have got the base address where to relocate compressed kernel image for safe decompressing, we need to do the last preparation before we can see the transition to 64-bit mode. At first we need to update the Global Descriptor Table for this:

	leal	gdt(%ebp), %eax
	movl	%eax, gdt+2(%ebp)
	lgdt	gdt(%ebp)

Here we put the base address from ebp register with gdt offset into the eax register. Next we put this address into ebp regist with offset gdt+2 and load the Global Descriptor Table with the lgdt instruction. To understand magic with gdt offsets we need to look at the definition of the Global Descriptor Table. We can find its definition in the same source code file:

	.data
gdt:
	.word	gdt_end - gdt
	.long	gdt
	.word	0
	.quad	0x0000000000000000	/* NULL descriptor */
	.quad	0x00af9a000000ffff	/* __KERNEL_CS */
	.quad	0x00cf92000000ffff	/* __KERNEL_DS */
	.quad	0x0080890000000000	/* TS descriptor */
	.quad   0x0000000000000000	/* TS continued */
gdt_end:

We can see that it is located in the .data section and contains five descriptors: null descriptor, kernel code segment, kernel data segment and two task descriptors. We already loaded the Global Descriptor Table in the previous part, and now we're doing almost the same here, but descriptors with CS.L = 1 and CS.D = 0 for execution in 64 bit mode. As we can see, the definition of the gdt starts from two bytes: gdt_end - gdt which represents last byte in the gdt table or table limit. The next four bytes contains base address of the gdt. Remember that the Global Descriptor Table is stored in the 48-bits GDTR which consists of two parts:

  • size(16-bit) of global descriptor table;
  • address(32-bit) of the global descriptor table.

So, we put address of the gdt to the eax register and then we put it to the .long gdt or gdt+2 in our assembly code. From now we have formed structure for the GDTR register and can load the Global Descriptor Table with the lgtd instruction.

After we have loaded the Global Descriptor Table, we must enable PAE mode by putting the value of the cr4 register into eax, setting 5 bit in it and loading it again into cr4:

	movl	%cr4, %eax
	orl	$X86_CR4_PAE, %eax
	movl	%eax, %cr4

Now we are almost finished with all preparations before we can move into 64-bit mode. The last step is to build page tables, but before that, here is some information about long mode.

Long mode

Long mode is the native mode for x86_64 processors. First of all let's look at some differences between x86_64 and x86.

It provides features such as:

  • New 8 general purpose registers from r8 to r15 + all general purpose registers are 64-bit now
  • 64-bit instruction pointer - RIP
  • New operating mode - Long mode
  • 64-Bit Addresses and Operands
  • RIP Relative Addressing (we will see an example if it in the next parts)

Long mode is an extension of legacy protected mode. It consists of two sub-modes:

  • 64-bit mode
  • compatibility mode

To switch into 64-bit mode we need to do following things:

  • enable PAE (we already did it, see above)
  • build page tables and load the address of the top level page table into the cr3 register
  • enable EFER.LME
  • enable paging

We already enabled PAE by setting the PAE bit in the cr4 register. Now let's look at paging.

Early page tables initialization

Before we can move into 64-bit mode, we need to build page tables, so, let's look at the building of early 4G boot page tables.

NOTE: I will not describe theory of virtual memory here, if you need to know more about it, see links in the end

The Linux kernel uses 4-level paging, and generally we build 6 page tables:

  • One PML4 table
  • One PDP table
  • Four Page Directory tables

Let's look at the implementation of it. First of all we clear the buffer for the page tables in memory. Every table is 4096 bytes, so we need 24 kilobytes buffer:

	leal	pgtable(%ebx), %edi
	xorl	%eax, %eax
	movl	$((4096*6)/4), %ecx
	rep	stosl

We put the address stored in ebx (remember that ebx contains the address to relocate the kernel for decompression) with pgtable offset to the edi register. pgtable is defined in the end of head_64.S and looks:

	.section ".pgtable","a",@nobits
	.balign 4096
pgtable:
	.fill 6*4096, 1, 0

It is in the .pgtable section and its size is 24 kilobytes. After we put the address in edi, we zero out the eax register and write zeros to the buffer with the rep stosl instruction.

Now we can build the top level page table - PML4 - with:

	leal	pgtable + 0(%ebx), %edi
	leal	0x1007 (%edi), %eax
	movl	%eax, 0(%edi)

Here we get the address stored in the ebx with pgtable offset and put it in edi. Next we put this address with offset 0x1007 in the eax register. 0x1007 is 4096 bytes (size of the PML4) + 7 (PML4 entry flags - PRESENT+RW+USER) and puts eax in edi. After this manipulation edi will contain the address of the first Page Directory Pointer Entry with flags - PRESENT+RW+USER.

In the next step we build 4 Page Directory entries in the Page Directory Pointer table with 0x7 flags or present, write, userspace (PRESENT WRITE | USER):

	leal	pgtable + 0x1000(%ebx), %edi
	leal	0x1007(%edi), %eax
	movl	$4, %ecx
1:  movl	%eax, 0x00(%edi)
	addl	$0x00001000, %eax
	addl	$8, %edi
	decl	%ecx
	jnz	1b

We put the base address of the page directory pointer table in edi and the address of the first page directory pointer entry in eax. Put 4 in the ecx register, it will be a counter in the following loop and write the address of the first page directory pointer table entry to the edi register.

After this edi will contain the address of the first page directory pointer entry with flags 0x7. Next we just calculate the address of following page directory pointer entries where each entry is 8 bytes, and write their addresses to eax.

The next step is building the 2048 page table entries with 2-MByte page:

	leal	pgtable + 0x2000(%ebx), %edi
	movl	$0x00000183, %eax
	movl	$2048, %ecx
1:  movl	%eax, 0(%edi)
	addl	$0x00200000, %eax
	addl	$8, %edi
	decl	%ecx
	jnz	1b

Here we do almost the same as in the previous example, all entries will be with flags - $0x00000183 - PRESENT + WRITE + MBZ. In the end we will have 2048 pages with 2-MByte page.

Our early page table structure are done, it maps 4 gigabytes of memory and now we can put the address of the high-level page table - PML4 - in cr3 control register:

	leal	pgtable(%ebx), %eax
	movl	%eax, %cr3

That's all. Now we can see transition to the long mode.

Transition to long mode

First of all we need to set the EFER.LME flag in the MSR to 0xC0000080:

	movl	$MSR_EFER, %ecx
	rdmsr
	btsl	$_EFER_LME, %eax
	wrmsr

Here we put the MSR_EFER flag (which is defined in arch/x86/include/uapi/asm/msr-index.h) in the ecx register and call rdmsr instruction which reads the MSR register. After rdmsr executes, we will have the resulting data in edx:eax which depends on the ecx value. We check the EFER_LME bit with the btsl instruction and write data from eax to the MSR register with the wrmsr instruction.

In the next step we push the address of the kernel segment code to the stack (we defined it in the GDT) and put the address of the startup_64 routine in eax.

	pushl	$__KERNEL_CS
	leal	startup_64(%ebp), %eax

After this we push this address to the stack and enable paging by setting PG and PE bits in the cr0 register:

	movl	$(X86_CR0_PG | X86_CR0_PE), %eax
	movl	%eax, %cr0

and call:

lret

Remember that we pushed the address of the startup_64 function to the stack in the previous step, and after the lret instruction, the CPU extracts the address of it and jumps there.

After all of these steps we're finally in 64-bit mode:

	.code64
	.org 0x200
ENTRY(startup_64)
....
....
....

That's all!

Conclusion

This is the end of the fourth part linux kernel booting process. If you have questions or suggestions, ping me in twitter 0xAX, drop me email or just create an issue.

In the next part we will see kernel decompression and many more.

Please note that English is not my first language and I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.