mirror of https://github.com/0xAX/linux-insides.git synced 2024-11-01 04:29:39 +00:00

0xAX aedf30cbf4 Update linux-bootstrap-1.md

2015-07-28 23:39:57 +06:00

25 KiB

Raw Blame History

Kernel booting process. Part 1.

From the bootloader to kernel

If you have read my previous blog posts, you can see that sometime ago I started to get involved with low-level programming. I wrote some posts about x86_64 assembly programming for Linux. At the same time, I started to dive into the Linux source code. I have a great interest in understanding how low-level things work, how programs run on my computer, how they are located in memory, how the kernel manages processes and memory, how the network stack works on low-level and many many other things. So, I decided to write yet another series of posts about the Linux kernel for x86_64.

Note that I'm not a professional kernel hacker and I don't write code for the kernel at work. It's just a hobby. I just like low-level stuff, and it is interesting for me to see how these things work. So if you notice anything confusing, or if you have any questions/remarks, ping me on twitter 0xAX, drop me an email or just create an issue. I appreciate it. All posts will also be accessible at linux-insides and if you find something wrong with my English or the post content, feel free to send a pull request.

Note that this isn't the official documentation, just learning and sharing knowledge.

Required knowledge

Understanding C code
Understanding assembly code (AT&T syntax)

Anyway, if you just started to learn some tools, I will try to explain some parts during this and the following posts. Ok, little introduction finished and now we can start to dive into the kernel and low-level stuff.

All code is actually for kernel - 3.18. If there are changes, I will update the posts accordingly.

The Magic Power Button, What happens next?

Despite that this is a series of posts about Linux kernel, we will not start from kernel code (at least in this paragraph). Ok, you pressed the magic power button on your laptop or desktop computer and it started to work. After the motherboard sends a signal to the power supply, the power supply provides the computer with the proper amount of electricity. Once motherboard receives the power good signal, it tries to run the CPU. The CPU resets all leftover data in its registers and sets up predefined values for every register.

80386 and later CPUs define the following predefined data in CPU registers after the computer resets:

IP          0xfff0
CS selector 0xf000
CS base     0xffff0000

The processor starts working in real mode and we need to back up a little to understand memory segmentation in this mode. Real mode is supported in all x86-compatible processors, from 8086 to modern Intel 64-bit CPUs. The 8086 processor had a 20-bit address bus, which means that it could work with 0-2^20 bytes address space (1 megabyte). But it only has 16-bit registers, and with 16-bit registers the maximum address is 2^16 or 0xffff (64 kilobytes). Memory segmentation is used to make use of all of the address space available. All memory is divided into small, fixed-size segments of 65535 bytes, or 64 KB. Since we cannot address memory below 64 KB with 16 bit registers, an alternate method to do it was devised. An address consists of two parts: the beginning address of the segment and the offset from the beginning of this segment. To get a physical address in memory, we need to multiply the segment part by 16 and add the offset part:

PhysicalAddress = Segment * 16 + Offset

For example if CS:IP is 0x2000:0x0010, the corresponding physical address will be:

>>> hex((0x2000 << 4) + 0x0010)
'0x20010'

But if we take the biggest segment part and offset: 0xffff:0xffff, it will be:

>>> hex((0xffff << 4) + 0xffff)
'0x10ffef'

which is 65519 bytes over first megabyte. Since only one megabyte is accessible in real mode, 0x10ffef becomes 0x00ffef with disabled A20.

Ok, now we know about real mode and memory addressing. Let's get back to register values after reset.

CS register consists of two parts: the visible segment selector and hidden base address. We know predefined CS base and IP value, logical address will be:

0xffff0000:0xfff0

In this way starting address formed by adding the base address to the value in the EIP register:

>>> 0xffff0000 + 0xfff0
'0xfffffff0'

We get 0xfffffff0 which is 4GB - 16 bytes. This point is the Reset vector. This is the memory location at which CPU expects to find the first instruction to execute after reset. It contains a jump instruction which usually points to the BIOS entry point. For example, if we look in coreboot source code, we will see it:

	.section ".reset"
	.code16
.globl	reset_vector
reset_vector:
	.byte  0xe9
	.int   _start - ( . + 2 )
	...

We can see here the jump instruction opcode - 0xe9 to the address _start - ( . + 2). And we can see that reset section is 16 bytes and starts at 0xfffffff0:

SECTIONS {
	_ROMTOP = 0xfffffff0;
	. = _ROMTOP;
	.reset . : {
		*(.reset)
		. = 15 ;
		BYTE(0x00);
	}
}

Now the BIOS has started to work. After initializing and checking the hardware, it needs to find a bootable device. A boot order is stored in the BIOS configuration. The function of boot order is to control which devices the kernel attempts to boot. In the case of attempting to boot a hard drive, the BIOS tries to find a boot sector. On hard drives partitioned with an MBR partition layout, the boot sector is stored in the first 446 bytes of the first sector (512 bytes). The final two bytes of the first sector are 0x55 and 0xaa which signals the BIOS that the device is bootable. For example:

;
; Note: this example is written in Intel Assembly syntax
;
[BITS 16]
[ORG  0x7c00]

boot:
    mov al, '!'
    mov ah, 0x0e
    mov bh, 0x00
    mov bl, 0x07

    int 0x10
    jmp $

times 510-($-$$) db 0

db 0x55
db 0xaa

Build and run it with:

nasm -f bin boot.nasm && qemu-system-x86_64 boot

This will instruct QEMU to use the boot binary we just built as a disk image. Since the binary generated by the assembly code above fulfills the requirements of the boot sector (the origin is set to 0x7c00, and we end with the magic sequence). QEMU will treat the binary as the master boot record(MBR) of a disk image.

We will see:

In this example we can see that this code will be executed in 16 bit real mode and will start at 0x7c00 in memory. After the start it calls the 0x10 interrupt which just prints ! symbol. It fills rest of 510 bytes with zeros and finish with two magic bytes 0xaa and 0x55.

You can see binary dump of it with objdump util:

nasm -f bin boot.nasm
objdump -D -b binary -mi386 -Maddr16,data16,intel boot

A real-world boot sector has code for continuing the boot process and the partition table instead of a bunch of 0's and an exclamation point :) Ok so, from this point onwards BIOS hands over the control to the bootloader and we can go ahead.

NOTE: As you can read above the CPU is in real mode. In real mode, calculating the physical address in memory is done as following:

PhysicalAddress = Segment * 16 + Offset

Same as I mentioned before. But we have only 16 bit general purpose registers. The maximum value of 16 bit register is: 0xffff; So if we take the biggest values the result will be:

>>> hex((0xffff * 16) + 0xffff)
'0x10ffef'

Where 0x10ffef is equal to 1MB + 64KB - 16b. But a 8086 processor, which was the first processor with real mode. It had 20 bit address line and 2^20 = 1048576.0 is 1MB. So, it means that the actual memory available is 1MB.

General real mode's memory map is:

0x00000000 - 0x000003FF - Real Mode Interrupt Vector Table
0x00000400 - 0x000004FF - BIOS Data Area
0x00000500 - 0x00007BFF - Unused
0x00007C00 - 0x00007DFF - Our Bootloader
0x00007E00 - 0x0009FFFF - Unused
0x000A0000 - 0x000BFFFF - Video RAM (VRAM) Memory
0x000B0000 - 0x000B7777 - Monochrome Video Memory
0x000B8000 - 0x000BFFFF - Color Video Memory
0x000C0000 - 0x000C7FFF - Video ROM BIOS
0x000C8000 - 0x000EFFFF - BIOS Shadow Area
0x000F0000 - 0x000FFFFF - System BIOS

But stop, at the beginning of post I wrote that first instruction executed by the CPU is located at the address 0xFFFFFFF0, which is much bigger than 0xFFFFF (1MB). How can CPU access it in real mode? As I write about it and you can read in coreboot documentation:

0xFFFE_0000 - 0xFFFF_FFFF: 128 kilobyte ROM mapped into address space

At the start of execution BIOS is not in RAM, it is located in the ROM.

Bootloader

There are a number of bootloaders which can boot Linux, such as GRUB 2 and syslinux. The Linux kernel has a Boot protocol which specifies the requirements for bootloaders to implement Linux support. This example will describe GRUB 2.

Now that the BIOS has chosen a boot device and transferred control to the boot sector code, execution starts from boot.img. This code is very simple due to the limited amount of space available, and contains a pointer that it uses to jump to the location of GRUB 2's core image. The core image begins with diskboot.img, which is usually stored immediately after the first sector in the unused space before the first partition. The above code loads the rest of the core image into memory, which contains GRUB 2's kernel and drivers for handling filesystems. After loading the rest of the core image, it executes grub_main.

grub_main initializes console, gets base address for modules, sets root device, loads/parses grub configuration file, loads modules etc. At the end of execution, grub_main moves grub to normal mode. grub_normal_execute (from grub-core/normal/main.c) completes last preparation and shows a menu for selecting an operating system. When we select one of grub menu entries, grub_menu_execute_entry begins to be executed, which executes grub boot command. It starts to boot the selected operating system.

As we can read in the kernel boot protocol, the bootloader must read and fill some fields of kernel setup header which starts at 0x01f1 offset from the kernel setup code. Kernel header arch/x86/boot/header.S starts from:

	.globl hdr
hdr:
	setup_sects: .byte 0
	root_flags:  .word ROOT_RDONLY
	syssize:     .long 0
	ram_size:    .word 0
	vid_mode:    .word SVGA_MODE
	root_dev:    .word 0
	boot_flag:   .word 0xAA55

The bootloader must fill this and the rest of the headers (only marked as write in the Linux boot protocol, for example this) with values which it either got from command line or calculated. We will not see description and explanation of all fields of kernel setup header, we will get back to it when kernel uses it. Anyway, you can find description of any field in the boot protocol.

As we can see in kernel boot protocol, the memory map will be the following after kernel loading:

         | Protected-mode kernel  |
100000   +------------------------+
         | I/O memory hole        |
0A0000   +------------------------+
         | Reserved for BIOS      | Leave as much as possible unused
         ~                        ~
         | Command line           | (Can also be below the X+10000 mark)
X+10000  +------------------------+
         | Stack/heap             | For use by the kernel real-mode code.
X+08000  +------------------------+
         | Kernel setup           | The kernel real-mode code.
         | Kernel boot sector     | The kernel legacy boot sector.
       X +------------------------+
         | Boot loader            | <- Boot sector entry point 0x7C00
001000   +------------------------+
         | Reserved for MBR/BIOS  |
000800   +------------------------+
         | Typically used by MBR  |
000600   +------------------------+
         | BIOS use only          |
000000   +------------------------+

So after the bootloader transferred control to the kernel, it starts somewhere at:

0x1000 + X + sizeof(KernelBootSector) + 1

where X is the address of kernel bootsector loaded. In my case X is 0x10000, we can see it in memory dump:

Ok, now the bootloader has loaded Linux kernel into the memory, filled header fields and jumped to it. Now we can move directly to the kernel setup code.

Start of Kernel Setup

Finally we are in the kernel. Technically kernel didn't run yet, first of all we need to setup kernel, memory manager, process manager etc. Kernel setup execution starts from arch/x86/boot/header.S at the _start. It is a little strange at the first look, there are many instructions before it.

Actually Long time ago Linux kernel had its own bootloader, but now if you run for example:

qemu-system-x86_64 vmlinuz-3.18-generic

You will see:

Actually header.S starts from MZ (see image above), error message printing and following PE header:

#ifdef CONFIG_EFI_STUB
# "MZ", MS-DOS header
.byte 0x4d
.byte 0x5a
#endif
...
...
...
pe_header:
	.ascii "PE"
	.word 0

It needs this for loading the operating system with UEFI. Here we will not see how it works (we will these later in the next parts).

So the actual kernel setup entry point is:

// header.S line 292
.globl _start
_start:

Bootloader (grub2 and others) knows about this point (0x200 offset from MZ) and makes a jump directly to this point, despite the fact that header.S starts from .bstext section which prints error message:

//
// arch/x86/boot/setup.ld
//
. = 0;                    // current position
.bstext : { *(.bstext) }  // put .bstext section to position 0
.bsdata : { *(.bsdata) }

So kernel setup entry point is:

	.globl _start
_start:
	.byte 0xeb
	.byte start_of_setup-1f
1:
	//
	// rest of the header
	//

Here we can see jmp instruction opcode - 0xeb to the start_of_setup-1f point. Nf notation means following: 2f refers to the next local 2: label. In our case it is label 1 which goes right after jump. It contains rest of setup header and right after setup header we can see .entrytext section which starts at start_of_setup label.

Actually it's the first code which starts to execute besides previous jump instruction. After kernel setup got the control from bootloader, first jmp instruction is located at 0x200 (first 512 bytes) offset from the start of kernel real mode. This we can read in Linux kernel boot protocol and also see in grub2 source code:

  state.gs = state.fs = state.es = state.ds = state.ss = segment;
  state.cs = segment + 0x20;

It means that segment registers will have following values after kernel setup starts to work:

fs = es = ds = ss = 0x1000
cs = 0x1020

for my case when kernel loaded at 0x10000.

After jump to start_of_setup, it needs to do the following things:

Be sure that all values of all segment registers are equal
Setup correct stack if needed
Setup bss
Jump to C code at main.c

Let's look at implementation.

Segment registers align

First of all it ensures that ds and es segment registers point to the same address and disable interrupts with cli instruction:

	movw	%ds, %ax
	movw	%ax, %es
	cli

As I wrote above, grub2 loads kernel setup code at 0x10000 address and cs at 0x1020 because execution doesn't start from the start of file, but from:

_start:
	.byte 0xeb
	.byte start_of_setup-1f

jump, which is 512 bytes offset from the 4d 5a. Also need to align cs from 0x10200 to 0x10000 as all other segment registers. After that we setup the stack:

	pushw	%ds
	pushw	$6f
	lretw

push ds value to stack, and address of 6 label and execute lretw instruction. When we call lretw, it loads address of label 6 to instruction pointer register and cs with value of ds. After it we will have ds and cs with the same values.

Stack Setup

Actually, almost all of the setup code is preparation for C language environment in the real mode. The next step is checking of ss register value and making of correct stack if ss is wrong:

	movw	%ss, %dx
	cmpw	%ax, %dx
	movw	%sp, %dx
	je	2f

Generally, it can be 3 different cases:

ss has valid value 0x10000 (as all other segment registers beside cs)
ss is invalid and CAN_USE_HEAP flag is set (see below)
ss is invalid and CAN_USE_HEAP flag is not set (see below)

Let's look at all of these cases:

ss has a correct address (0x10000). In this case we go to label 2:

2: 	andw	$~3, %dx
	jnz	3f
	movw	$0xfffc, %dx
3:  movw	%ax, %ss
	movzwl %dx, %esp
	sti

Here we can see aligning of dx (contains sp given by bootloader) to 4 bytes and checking that it is not zero. If it is zero we put 0xfffc (4 byte aligned address before maximum segment size - 64 KB) to dx. If it is not zero we continue to use sp given by bootloader (0xf7f4 in my case). After this we put ax value to ss which stores correct segment address 0x10000 and set up correct sp. After it we have correct stack:

In the second case (ss != ds), first of all put _end (address of end of setup code) value in dx. And check loadflags header field with testb instruction too see if we can use heap or not. loadflags is a bitmask header which is defined as:

#define LOADED_HIGH	    (1<<0)
#define QUIET_FLAG	    (1<<5)
#define KEEP_SEGMENTS	(1<<6)
#define CAN_USE_HEAP	(1<<7)

And as we can read in the boot protocol:

Field name:	loadflags

  This field is a bitmask.

  Bit 7 (write): CAN_USE_HEAP
	Set this bit to 1 to indicate that the value entered in the
	heap_end_ptr is valid.  If this field is clear, some setup code
	functionality will be disabled.

If CAN_USE_HEAP bit is set, put heap_end_ptr to dx which points to _end and add STACK_SIZE (minimal stack size - 512 bytes) to it. After this if dx is not carry, jump to 2 (it will not be carry, dx = _end + 512) label as in previous case and make correct stack.

The last case when CAN_USE_HEAP is not set, we just use minimal stack from _end to _end + STACK_SIZE:

BSS Setup

The last two steps that need to happen before we can jump to the main C code, are that we need to set up the BSS area, and check the "magic" signature. Firstly, signature checking:

cmpl	$0x5a5aaa55, setup_sig
jne	setup_bad

This simply consists of comparing the setup_sig against the magic number 0x5a5aaa55. If they are not equal, a fatal error is reported.

But if the magic number matches, knowing we have a set of correct segment registers, and a stack, we need only setup the BSS section before jumping into the C code.

The BSS section is used for storing statically allocated, uninitialized, data. Linux carefully ensures this area of memory is first blanked, using the following code:

	movw	$__bss_start, %di
	movw	$_end+3, %cx
	xorl	%eax, %eax
	subw	%di, %cx
	shrw	$2, %cx
	rep; stosl

First of all the __bss_start address is moved into di, and the _end + 3 address (+3 - aligns to 4 bytes) is moved into cx. The eax register is cleared (using an xor instruction), and the bss section size (cx-di) is calculated and put into cx. Then, cx is divided by four (the size of a 'word'), and the stosl instruction is repeatedly used, storing the value of eax (zero) into the address pointed to by di, and automatically increasing di by four (this occurs until cx reaches zero). The net effect of this code, is that zeros are written through all words in memory from __bss_start to _end:

Jump to main

That's all, we have the stack, BSS and now we can jump to the main() C function:

	calll main

The main() function is located in arch/x86/boot/main.c. What will be there? We will see it in the next part.

Conclusion

This is the end of the first part about Linux kernel internals. If you have questions or suggestions, ping me in twitter 0xAX, drop me email or just create issue. In the next part we will see first C code which executes in Linux kernel setup, implementation of memory routines as memset, memcpy, earlyprintk implementation and early console initialization and many more.

Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to linux-internals.

25 KiB Raw Blame History