mirror of
https://github.com/0xAX/linux-insides.git
synced 2024-12-22 22:58:08 +00:00
Merge pull request #420 from anastasds/linux-bootstrap-1-edit
Proofread linux-bootstrap-1 and edit to read more smoothly
This commit is contained in:
commit
0ebe9f97ba
@ -4,26 +4,26 @@ Kernel booting process. Part 1.
|
|||||||
From the bootloader to the kernel
|
From the bootloader to the kernel
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
|
|
||||||
If you have been reading my previous [blog posts](http://0xax.blogspot.com/search/label/asm) then you can see that from some time I have started to get involve in low-level programming. I have written some posts about x86_64 assembly programming for Linux and at the same time I have also started to dive into the Linux source code. I have a great interest in understanding how low-level things work, how programs run on my computer, how are they located in memory, how the kernel manages processes & memory, how the network stack works at a low level and many many other things. So, I decided to write yet another series of posts about the Linux kernel for **x86_64**.
|
If you have been reading my previous [blog posts](http://0xax.blogspot.com/search/label/asm), then you can see that, for some time, I have been starting to get involved in low-level programming. I have written some posts about x86_64 assembly programming for Linux and, at the same time, I have also started to dive into the Linux source code. I have a great interest in understanding how low-level things work, how programs run on my computer, how are they located in memory, how the kernel manages processes & memory, how the network stack works at a low level, and many many other things. So, I have decided to write yet another series of posts about the Linux kernel for **x86_64**.
|
||||||
|
|
||||||
Note that I'm not a professional kernel hacker and I don't write code for the kernel at work. It's just a hobby. I just like low-level stuff, and it is interesting for me to see how these things work. So if you notice anything confusing, or if you have any questions/remarks, ping me on twitter [0xAX](https://twitter.com/0xAX), drop me an [email](anotherworldofworld@gmail.com) or just create an [issue](https://github.com/0xAX/linux-insides/issues/new). I appreciate it. All posts will also be accessible at [linux-insides](https://github.com/0xAX/linux-insides) and if you find something wrong with my English or the post content, feel free to send a pull request.
|
Note that I'm not a professional kernel hacker and I don't write code for the kernel at work. It's just a hobby. I just like low-level stuff, and it is interesting for me to see how these things work. So if you notice anything confusing, or if you have any questions/remarks, ping me on twitter [0xAX](https://twitter.com/0xAX), drop me an [email](anotherworldofworld@gmail.com) or just create an [issue](https://github.com/0xAX/linux-insides/issues/new). I appreciate it. All posts will also be accessible at [linux-insides](https://github.com/0xAX/linux-insides) and, if you find something wrong with my English or the post content, feel free to send a pull request.
|
||||||
|
|
||||||
|
|
||||||
*Note that this isn't the official documentation, just learning and sharing knowledge.*
|
*Note that this isn't official documentation, just learning and sharing knowledge.*
|
||||||
|
|
||||||
**Required knowledge**
|
**Required knowledge**
|
||||||
|
|
||||||
* Understanding C code
|
* Understanding C code
|
||||||
* Understanding assembly code (AT&T syntax)
|
* Understanding assembly code (AT&T syntax)
|
||||||
|
|
||||||
Anyway, if you just start to learn some tools, I will try to explain some parts during this and the following posts. Alright, this is the end of simple introduction and now we can start to dive into the kernel and low-level stuff.
|
Anyway, if you just start to learn some tools, I will try to explain some parts during this and the following posts. Alright, this is the end of the simple introduction, and now we can start to dive into the kernel and low-level stuff.
|
||||||
|
|
||||||
All code is actually for kernel - 3.18. If there are changes, I will update the posts accordingly.
|
All code is actually for the 3.18 kernel. If there are changes, I will update the posts accordingly.
|
||||||
|
|
||||||
The Magical Power Button, What happens next?
|
The Magical Power Button, What happens next?
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
|
|
||||||
Although this is a series of posts about the Linux kernel, we will not be starting from the kernel code (at least not in this paragraph). As soon as you press the magical power button on your laptop or desktop computer, it starts working. The motherboard sends a signal to the [power supply](https://en.wikipedia.org/wiki/Power_supply). After receiving the signal, the power supply provides proper amount of electricity to the computer. Once the motherboard receives the [power good signal](https://en.wikipedia.org/wiki/Power_good_signal), it tries to start the CPU. The CPU resets all leftover data in its registers and sets up predefined values for each of them.
|
Although this is a series of posts about the Linux kernel, we will not be starting from the kernel code - at least not, in this paragraph. As soon as you press the magical power button on your laptop or desktop computer, it starts working. The motherboard sends a signal to the [power supply](https://en.wikipedia.org/wiki/Power_supply). After receiving the signal, the power supply provides the proper amount of electricity to the computer. Once the motherboard receives the [power good signal](https://en.wikipedia.org/wiki/Power_good_signal), it tries to start the CPU. The CPU resets all leftover data in its registers and sets up predefined values for each of them.
|
||||||
|
|
||||||
|
|
||||||
[80386](https://en.wikipedia.org/wiki/Intel_80386) and later CPUs define the following predefined data in CPU registers after the computer resets:
|
[80386](https://en.wikipedia.org/wiki/Intel_80386) and later CPUs define the following predefined data in CPU registers after the computer resets:
|
||||||
@ -34,31 +34,31 @@ CS selector 0xf000
|
|||||||
CS base 0xffff0000
|
CS base 0xffff0000
|
||||||
```
|
```
|
||||||
|
|
||||||
The processor starts working in [real mode](https://en.wikipedia.org/wiki/Real_mode). Let's back up a little and try to understand memory segmentation in this mode. Real mode is supported on all x86-compatible processors, from the [8086](https://en.wikipedia.org/wiki/Intel_8086) all the way to the modern Intel 64-bit CPUs. The 8086 processor has a 20-bit address bus, which means that it could work with a 0-0x100000 address space (1 megabyte). But it only has 16-bit registers, and with 16-bit registers the maximum address is 2^16 - 1 or 0xffff (64 kilobytes). [Memory segmentation](http://en.wikipedia.org/wiki/Memory_segmentation) is used to make use of all the address space available. All memory is divided into small, fixed-size segments of 65536 bytes, or 64 KB. Since we cannot address memory above 64 KB with 16 bit registers, an alternate method is devised. An address consists of two parts: a segment selector which has an associated base address and an offset from this base address. In real mode, the associated base address of a segment selector is `Segment Selector * 16`. Thus, to get a physical address in memory, we need to multiply the segment selector part by 16 and add the offset part:
|
The processor starts working in [real mode](https://en.wikipedia.org/wiki/Real_mode). Let's back up a little and try to understand memory segmentation in this mode. Real mode is supported on all x86-compatible processors, from the [8086](https://en.wikipedia.org/wiki/Intel_8086) all the way to the modern Intel 64-bit CPUs. The 8086 processor has a 20-bit address bus, which means that it could work with a 0-0x100000 address space (1 megabyte). But it only has 16-bit registers, which hace a maximum address of 2^16 - 1 or 0xffff (64 kilobytes). [Memory segmentation](http://en.wikipedia.org/wiki/Memory_segmentation) is used to make use of all the address space available. All memory is divided into small, fixed-size segments of 65536 bytes (64 KB). Since we cannot address memory above 64 KB with 16 bit registers, an alternate method is devised. An address consists of two parts: a segment selector, which has a base address, and an offset from this base address. In real mode, the associated base address of a segment selector is `Segment Selector * 16`. Thus, to get a physical address in memory, we need to multiply the segment selector part by 16 and add the offset:
|
||||||
|
|
||||||
```
|
```
|
||||||
PhysicalAddress = Segment Selector * 16 + Offset
|
PhysicalAddress = Segment Selector * 16 + Offset
|
||||||
```
|
```
|
||||||
|
|
||||||
For example if `CS:IP` is `0x2000:0x0010`, the corresponding physical address will be:
|
For example, if `CS:IP` is `0x2000:0x0010`, then the corresponding physical address will be:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> hex((0x2000 << 4) + 0x0010)
|
>>> hex((0x2000 << 4) + 0x0010)
|
||||||
'0x20010'
|
'0x20010'
|
||||||
```
|
```
|
||||||
|
|
||||||
But if we take the largest segment selector and offset: `0xffff:0xffff`, it will be:
|
But, if we take the largest segment selector and offset, `0xffff:0xffff`, then the resulting address will be:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> hex((0xffff << 4) + 0xffff)
|
>>> hex((0xffff << 4) + 0xffff)
|
||||||
'0x10ffef'
|
'0x10ffef'
|
||||||
```
|
```
|
||||||
|
|
||||||
which is 65520 bytes over first megabyte. Since only one megabyte is accessible in real mode, `0x10ffef` becomes `0x00ffef` with disabled [A20](https://en.wikipedia.org/wiki/A20_line).
|
which is 65520 bytes past the first megabyte. Since only one megabyte is accessible in real mode, `0x10ffef` becomes `0x00ffef` with disabled [A20](https://en.wikipedia.org/wiki/A20_line).
|
||||||
|
|
||||||
Ok, now we know about real mode and memory addressing. Let's get back to discuss about register values after reset:
|
Ok, now we know about real mode and memory addressing. Let's get back to discussing register values after reset:
|
||||||
|
|
||||||
The `CS` register consists of two parts: the visible segment selector and the hidden base address. While the base address is normally formed by multiplying the segment selector value by 16, during a hardware reset, the segment selector in the CS register is loaded with 0xf000 and the base address is loaded with 0xffff0000. The processor uses this special base address until CS is changed.
|
The `CS` register consists of two parts: the visible segment selector, and the hidden base address. While the base address is normally formed by multiplying the segment selector value by 16, during a hardware reset the segment selector in the CS register is loaded with 0xf000 and the base address is loaded with 0xffff0000; the processor uses this special base address until `CS` is changed.
|
||||||
|
|
||||||
The starting address is formed by adding the base address to the value in the EIP register:
|
The starting address is formed by adding the base address to the value in the EIP register:
|
||||||
|
|
||||||
@ -67,7 +67,7 @@ The starting address is formed by adding the base address to the value in the EI
|
|||||||
'0xfffffff0'
|
'0xfffffff0'
|
||||||
```
|
```
|
||||||
|
|
||||||
We get `0xfffffff0` which is 4GB - 16 bytes. This point is called the [Reset vector](http://en.wikipedia.org/wiki/Reset_vector). This is the memory location at which the CPU expects to find the first instruction to execute after reset. It contains a [jump](http://en.wikipedia.org/wiki/JMP_%28x86_instruction%29) instruction which usually points to the BIOS entry point. For example, if we look in the [coreboot](http://www.coreboot.org/) source code, we see:
|
We get `0xfffffff0`, which is 4GB (16 bytes). This point is called the [Reset vector](http://en.wikipedia.org/wiki/Reset_vector). This is the memory location at which the CPU expects to find the first instruction to execute after reset. It contains a [jump](http://en.wikipedia.org/wiki/JMP_%28x86_instruction%29) (`jmp`) instruction that usually points to the BIOS entry point. For example, if we look in the [coreboot](http://www.coreboot.org/) source code, we see:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
.section ".reset"
|
.section ".reset"
|
||||||
@ -79,7 +79,7 @@ reset_vector:
|
|||||||
...
|
...
|
||||||
```
|
```
|
||||||
|
|
||||||
Here we can see the jmp instruction [opcode](http://ref.x86asm.net/coder32.html#xE9) - 0xe9 and its destination address - `_start - ( . + 2)`, and we can see that the `reset` section is 16 bytes and starts at `0xfffffff0`:
|
Here we can see the `jmp` instruction [opcode](http://ref.x86asm.net/coder32.html#xE9), which is 0xe9, and its destination address at `_start - ( . + 2)`. We can also see that the `reset` section is 16 bytes, and that it starts at `0xfffffff0`:
|
||||||
|
|
||||||
```
|
```
|
||||||
SECTIONS {
|
SECTIONS {
|
||||||
@ -93,7 +93,7 @@ SECTIONS {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
Now the BIOS starts: after initializing and checking the hardware, it needs to find a bootable device. A boot order is stored in the BIOS configuration, controlling which devices the BIOS attempts to boot from. When attempting to boot from a hard drive, the BIOS tries to find a boot sector. On hard drives partitioned with an MBR partition layout, the boot sector is stored in the first 446 bytes of the first sector (which is 512 bytes). The final two bytes of the first sector are `0x55` and `0xaa`, which signals the BIOS that this device is bootable. For example:
|
Now the BIOS starts; after initializing and checking the hardware, the BIOS needs to find a bootable device. A boot order is stored in the BIOS configuration, controlling which devices the BIOS attempts to boot from. When attempting to boot from a hard drive, the BIOS tries to find a boot sector. On hard drives partitioned with an MBR partition layout, the boot sector is stored in the first 446 bytes of the first sector, where each sectoris 512 bytes. The final two bytes of the first sector are `0x55` and `0xaa`, which designates to the BIOS that this device is bootable. For example:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
;
|
;
|
||||||
@ -117,45 +117,45 @@ db 0x55
|
|||||||
db 0xaa
|
db 0xaa
|
||||||
```
|
```
|
||||||
|
|
||||||
Build and run it with:
|
Build and run this with:
|
||||||
|
|
||||||
```
|
```
|
||||||
nasm -f bin boot.nasm && qemu-system-x86_64 boot
|
nasm -f bin boot.nasm && qemu-system-x86_64 boot
|
||||||
```
|
```
|
||||||
|
|
||||||
This will instruct [QEMU](http://qemu.org) to use the `boot` binary we just built as a disk image. Since the binary generated by the assembly code above fulfills the requirements of the boot sector (the origin is set to `0x7c00`, and we end with the magic sequence), QEMU will treat the binary as the master boot record (MBR) of a disk image.
|
This will instruct [QEMU](http://qemu.org) to use the `boot` binary that we just built as a disk image. Since the binary generated by the assembly code above fulfills the requirements of the boot sector (the origin is set to `0x7c00` and we end with the magic sequence), QEMU will treat the binary as the master boot record (MBR) of a disk image.
|
||||||
|
|
||||||
You will see:
|
You will see:
|
||||||
|
|
||||||
![Simple bootloader which prints only `!`](http://oi60.tinypic.com/2qbwup0.jpg)
|
![Simple bootloader which prints only `!`](http://oi60.tinypic.com/2qbwup0.jpg)
|
||||||
|
|
||||||
In this example we can see that the code will be executed in 16 bit real mode and will start at 0x7c00 in memory. After starting it calls the [0x10](http://www.ctyme.com/intr/rb-0106.htm) interrupt which just prints the `!` symbol. It fills the rest of the 510 bytes with zeros and finishes with the two magic bytes `0xaa` and `0x55`.
|
In this example we can see that the code will be executed in 16 bit real mode and will start at `0x7c00` in memory. After starting, it calls the [0x10](http://www.ctyme.com/intr/rb-0106.htm) interrupt, which just prints the `!` symbol; it fills the remaining 510 bytes with zeros and finishes with the two magic bytes `0xaa` and `0x55`.
|
||||||
|
|
||||||
You can see a binary dump of this with the `objdump` util:
|
You can see a binary dump of this using the `objdump` utility:
|
||||||
|
|
||||||
```
|
```
|
||||||
nasm -f bin boot.nasm
|
nasm -f bin boot.nasm
|
||||||
objdump -D -b binary -mi386 -Maddr16,data16,intel boot
|
objdump -D -b binary -mi386 -Maddr16,data16,intel boot
|
||||||
```
|
```
|
||||||
|
|
||||||
A real-world boot sector has code to continue the boot process and the partition table instead of a bunch of 0's and an exclamation mark :) From this point onwards, BIOS hands over control to the bootloader.
|
A real-world boot sector has code for continuing the boot process and a partition table instead of a bunch of 0's and an exclamation mark :) From this point onwards, the BIOS hands over control to the bootloader.
|
||||||
|
|
||||||
**NOTE**: As you can read above, the CPU is in real mode. In real mode, calculating the physical address in memory is done as follows:
|
**NOTE**: As explained above, the CPU is in real mode; in real mode, calculating the physical address in memory is done as follows:
|
||||||
|
|
||||||
```
|
```
|
||||||
PhysicalAddress = Segment Selector * 16 + Offset
|
PhysicalAddress = Segment Selector * 16 + Offset
|
||||||
```
|
```
|
||||||
|
|
||||||
The same as mentioned before. We have only 16 bit general purpose registers, the maximum value of a 16 bit register is `0xffff`, so if we take the largest values, the result will be:
|
just as explained before. We have only 16 bit general purpose registers; the maximum value of a 16 bit register is `0xffff`, so if we take the largest values, the result will be:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> hex((0xffff * 16) + 0xffff)
|
>>> hex((0xffff * 16) + 0xffff)
|
||||||
'0x10ffef'
|
'0x10ffef'
|
||||||
```
|
```
|
||||||
|
|
||||||
Where `0x10ffef` is equal to `1MB + 64KB - 16b`. But a [8086](https://en.wikipedia.org/wiki/Intel_8086) processor, which is the first processor with real mode, has a 20 bit address line and `2^20 = 1048576` is 1MB. This means the actual memory available is 1MB.
|
where `0x10ffef` is equal to `1MB + 64KB - 16b`. A [8086](https://en.wikipedia.org/wiki/Intel_8086) processor (which was the first processor with real mode), in contrast, has a 20 bit address line. Since `2^20 = 1048576` is 1MB, this means that the actual available memory is 1MB.
|
||||||
|
|
||||||
General real mode's memory map is:
|
General real mode's memory map is as follows:
|
||||||
|
|
||||||
```
|
```
|
||||||
0x00000000 - 0x000003FF - Real Mode Interrupt Vector Table
|
0x00000000 - 0x000003FF - Real Mode Interrupt Vector Table
|
||||||
@ -171,7 +171,7 @@ General real mode's memory map is:
|
|||||||
0x000F0000 - 0x000FFFFF - System BIOS
|
0x000F0000 - 0x000FFFFF - System BIOS
|
||||||
```
|
```
|
||||||
|
|
||||||
In the beginning of this post I wrote that the first instruction executed by the CPU is located at address `0xFFFFFFF0`, which is much larger than `0xFFFFF` (1MB). How can the CPU access this in real mode? This is in the [coreboot](http://www.coreboot.org/Developer_Manual/Memory_map) documentation:
|
In the beginning of this post, I wrote that the first instruction executed by the CPU is located at address `0xFFFFFFF0`, which is much larger than `0xFFFFF` (1MB). How can the CPU access this address in real mode? The answer is in the [coreboot](http://www.coreboot.org/Developer_Manual/Memory_map) documentation:
|
||||||
|
|
||||||
```
|
```
|
||||||
0xFFFE_0000 - 0xFFFF_FFFF: 128 kilobyte ROM mapped into address space
|
0xFFFE_0000 - 0xFFFF_FFFF: 128 kilobyte ROM mapped into address space
|
||||||
@ -182,13 +182,13 @@ At the start of execution, the BIOS is not in RAM, but in ROM.
|
|||||||
Bootloader
|
Bootloader
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
|
|
||||||
There are a number of bootloaders that can boot Linux, such as [GRUB 2](https://www.gnu.org/software/grub/) and [syslinux](http://www.syslinux.org/wiki/index.php/The_Syslinux_Project). The Linux kernel has a [Boot protocol](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt) which specifies the requirements for bootloaders to implement Linux support. This example will describe GRUB 2.
|
There are a number of bootloaders that can boot Linux, such as [GRUB 2](https://www.gnu.org/software/grub/) and [syslinux](http://www.syslinux.org/wiki/index.php/The_Syslinux_Project). The Linux kernel has a [Boot protocol](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt) which specifies the requirements for a bootloader to implement Linux support. This example will describe GRUB 2.
|
||||||
|
|
||||||
Now that the BIOS has chosen a boot device and transferred control to the boot sector code, execution starts from [boot.img](http://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/boot/i386/pc/boot.S;hb=HEAD). This code is very simple due to the limited amount of space available, and contains a pointer which is used to jump to the location of GRUB 2's core image. The core image begins with [diskboot.img](http://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/boot/i386/pc/diskboot.S;hb=HEAD), which is usually stored immediately after the first sector in the unused space before the first partition. The above code loads the rest of the core image into memory, which contains GRUB 2's kernel and drivers for handling filesystems. After loading the rest of the core image, it executes [grub_main](http://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/kern/main.c).
|
Continuing from before, now that the BIOS has chosen a boot device and transferred control to the boot sector code, execution starts from [boot.img](http://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/boot/i386/pc/boot.S;hb=HEAD). This code is very simple, due to the limited amount of space available, and contains a pointer which is used to jump to the location of GRUB 2's core image. The core image begins with [diskboot.img](http://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/boot/i386/pc/diskboot.S;hb=HEAD), which is usually stored immediately after the first sector in the unused space before the first partition. The above code loads the rest of the core image, which contains GRUB 2's kernel and drivers for handling filesystems, into memory. After loading the rest of the core image, it executes [grub_main](http://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/kern/main.c).
|
||||||
|
|
||||||
`grub_main` initializes the console, gets the base address for modules, sets the root device, loads/parses the grub configuration file, loads modules etc. At the end of execution, `grub_main` moves grub to normal mode. `grub_normal_execute` (from `grub-core/normal/main.c`) completes the last preparation and shows a menu to select an operating system. When we select one of the grub menu entries, `grub_menu_execute_entry` runs, which executes the grub `boot` command, booting the selected operating system.
|
`grub_main` initializes the console, gets the base address for modules, sets the root device, loads/parses the grub configuration file, loads modules, etc. At the end of execution, `grub_main` moves grub to normal mode. `grub_normal_execute` (from `grub-core/normal/main.c`) completes the final preparations and shows a menu to select an operating system. When we select one of the grub menu entries, `grub_menu_execute_entry` runs, executing the grub `boot` command and booting the selected operating system.
|
||||||
|
|
||||||
As we can read in the kernel boot protocol, the bootloader must read and fill some fields of the kernel setup header, which starts at `0x01f1` offset from the kernel setup code. The kernel header [arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S) starts from:
|
As we can read in the kernel boot protocol, the bootloader must read and fill some fields of the kernel setup header, which starts at the `0x01f1` offset from the kernel setup code. The kernel header [arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S) starts from:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
.globl hdr
|
.globl hdr
|
||||||
@ -202,7 +202,7 @@ hdr:
|
|||||||
boot_flag: .word 0xAA55
|
boot_flag: .word 0xAA55
|
||||||
```
|
```
|
||||||
|
|
||||||
The bootloader must fill this and the rest of the headers (only marked as `write` in the Linux boot protocol, for example [this](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt#L354)) with values which it either got from command line or calculated. We will not see a description and explanation of all fields of the kernel setup header, we will get back to that when the kernel uses them. You can find a description of all fields in the [boot protocol](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt#L156).
|
The bootloader must fill this and the rest of the headers (which are only marked as being type `write` in the Linux boot protocol, such as in [this example](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt#L354)) with values which it has either received from the command line or calculated. (We will not go over full descriptions and explanations for all fields of the kernel setup header now but instead when the discuss how kernel uses them; you can find a description of all fields in the [boot protocol](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt#L156).)
|
||||||
|
|
||||||
As we can see in the kernel boot protocol, the memory map will be the following after loading the kernel:
|
As we can see in the kernel boot protocol, the memory map will be the following after loading the kernel:
|
||||||
|
|
||||||
@ -231,34 +231,34 @@ X+08000 +------------------------+
|
|||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
So when the bootloader transfers control to the kernel, it starts at:
|
So, when the bootloader transfers control to the kernel, it starts at:
|
||||||
|
|
||||||
```
|
```
|
||||||
0x1000 + X + sizeof(KernelBootSector) + 1
|
0x1000 + X + sizeof(KernelBootSector) + 1
|
||||||
```
|
```
|
||||||
|
|
||||||
where `X` is the address of the kernel boot sector loaded. In my case `X` is `0x10000`, as we can see in a memory dump:
|
where `X` is the address of the kernel boot sector being loaded. In my case, `X` is `0x10000`, as we can see in a memory dump:
|
||||||
|
|
||||||
![kernel first address](http://oi57.tinypic.com/16bkco2.jpg)
|
![kernel first address](http://oi57.tinypic.com/16bkco2.jpg)
|
||||||
|
|
||||||
The bootloader has now loaded the Linux kernel into memory, filled the header fields and jumped to it. Now we can move directly to the kernel setup code.
|
The bootloader has now loaded the Linux kernel into memory, filled the header fields, and then jumped to the corresponding memory address. We can now move directly to the kernel setup code.
|
||||||
|
|
||||||
Start of Kernel Setup
|
Start of Kernel Setup
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
|
|
||||||
Finally we are in the kernel. Technically the kernel hasn't run yet, firstly we need to set up the kernel, memory manager, process manager etc. Kernel setup execution starts from [arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S) at [_start](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L293). It is a little strange at first sight, as there are several instructions before it.
|
Finally, we are in the kernel! Technically, the kernel hasn't run yet; first, we need to set up the kernel, memory manager, process manager, etc. Kernel setup execution starts from [arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S) at [_start](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L293). It is a little strange at first sight, as there are several instructions before it.
|
||||||
|
|
||||||
A Long time ago, the Linux kernel used to have its own bootloader but now if you run(for example):
|
A long time ago, the Linux kernel used to have its own bootloader. Now, however, if you run, for example,
|
||||||
|
|
||||||
```
|
```
|
||||||
qemu-system-x86_64 vmlinuz-3.18-generic
|
qemu-system-x86_64 vmlinuz-3.18-generic
|
||||||
```
|
```
|
||||||
|
|
||||||
You will see:
|
then you will see:
|
||||||
|
|
||||||
![Try vmlinuz in qemu](http://oi60.tinypic.com/r02xkz.jpg)
|
![Try vmlinuz in qemu](http://oi60.tinypic.com/r02xkz.jpg)
|
||||||
|
|
||||||
Actually `header.S` starts from [MZ](https://en.wikipedia.org/wiki/DOS_MZ_executable) (see image above), error message printing and following [PE](https://en.wikipedia.org/wiki/Portable_Executable) header:
|
Actually, `header.S` starts from [MZ](https://en.wikipedia.org/wiki/DOS_MZ_executable) (see image above), the error message printing and following the [PE](https://en.wikipedia.org/wiki/Portable_Executable) header:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
#ifdef CONFIG_EFI_STUB
|
#ifdef CONFIG_EFI_STUB
|
||||||
@ -274,9 +274,9 @@ pe_header:
|
|||||||
.word 0
|
.word 0
|
||||||
```
|
```
|
||||||
|
|
||||||
It needs this to load an operating system with [UEFI](https://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface). We won't be looking into its working right now, we'll cover it in upcoming chapters.
|
It needs this to load an operating system with [UEFI](https://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface). We won't be looking into its inner workings right now and will cover it in upcoming chapters.
|
||||||
|
|
||||||
So the actual kernel setup entry point is:
|
The actual kernel setup entry point is:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
// header.S line 292
|
// header.S line 292
|
||||||
@ -284,7 +284,7 @@ So the actual kernel setup entry point is:
|
|||||||
_start:
|
_start:
|
||||||
```
|
```
|
||||||
|
|
||||||
The bootloader (grub2 and others) knows about this point (`0x200` offset from `MZ`) and makes a jump directly to this point, despite the fact that `header.S` starts from `.bstext` section which prints an error message:
|
The bootloader (grub2 and others) knows about this point (`0x200` offset from `MZ`) and makes a jump directly to it, despite the fact that `header.S` starts from the `.bstext` section, which prints an error message:
|
||||||
|
|
||||||
```
|
```
|
||||||
//
|
//
|
||||||
@ -295,7 +295,7 @@ The bootloader (grub2 and others) knows about this point (`0x200` offset from `M
|
|||||||
.bsdata : { *(.bsdata) }
|
.bsdata : { *(.bsdata) }
|
||||||
```
|
```
|
||||||
|
|
||||||
So the kernel setup entry point is:
|
The kernel setup entry point is:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
.globl _start
|
.globl _start
|
||||||
@ -308,9 +308,9 @@ _start:
|
|||||||
//
|
//
|
||||||
```
|
```
|
||||||
|
|
||||||
Here we can see a `jmp` instruction opcode - `0xeb` to the `start_of_setup-1f` point. `Nf` notation means `2f` refers to the next local `2:` label. In our case it is label `1` which goes right after jump. It contains the rest of the setup [header](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt#L156). Right after the setup header we see the `.entrytext` section which starts at the `start_of_setup` label.
|
Here we can see a `jmp` instruction opcode (`0xeb`) that jumps to the `start_of_setup-1f` point. In `Nf` notation, `2f` refers to the following local `2:` label; in our case, it is label `1` that is present right after jump, and it contains the rest of the setup [header](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt#L156). Right after the setup header, we see the `.entrytext` section, which starts at the `start_of_setup` label.
|
||||||
|
|
||||||
Actually this is the first code that runs (aside from the previous jump instruction of course). After the kernel setup got the control from the bootloader, the first `jmp` instruction is located at `0x200` (first 512 bytes) offset from the start of the kernel real mode. This we can read in the Linux kernel boot protocol and also see in the grub2 source code:
|
This is the first code that actually runs (aside from the previous jump instructions, of course). After the kernel setup received control from the bootloader, the first `jmp` instruction is located at the `0x200` offset from the start of the kernel real mode, i.e., after the first 512 bytes. This we can both read in the Linux kernel boot protocol and see in the grub2 source code:
|
||||||
|
|
||||||
```C
|
```C
|
||||||
segment = grub_linux_real_target >> 4;
|
segment = grub_linux_real_target >> 4;
|
||||||
@ -318,28 +318,28 @@ state.gs = state.fs = state.es = state.ds = state.ss = segment;
|
|||||||
state.cs = segment + 0x20;
|
state.cs = segment + 0x20;
|
||||||
```
|
```
|
||||||
|
|
||||||
It means that segment registers will have the following values after kernel setup starts:
|
This means that segment registers will have the following values after kernel setup starts:
|
||||||
|
|
||||||
```
|
```
|
||||||
gs = fs = es = ds = ss = 0x1000
|
gs = fs = es = ds = ss = 0x1000
|
||||||
cs = 0x1020
|
cs = 0x1020
|
||||||
```
|
```
|
||||||
|
|
||||||
In my case when the kernel is loaded at `0x10000`.
|
In my case, the kernel is loaded at `0x10000`.
|
||||||
|
|
||||||
After the jump to `start_of_setup`, it needs to do the following:
|
After the jump to `start_of_setup`, the kernel needs to do the following:
|
||||||
|
|
||||||
* Be sure that all values of all segment registers are equal
|
* Make sure that all segment register values are equal
|
||||||
* Set up correct stack if needed
|
* Set up a correct stack, if needed
|
||||||
* Set up [bss](https://en.wikipedia.org/wiki/.bss)
|
* Set up [bss](https://en.wikipedia.org/wiki/.bss)
|
||||||
* Jump to C code at [main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c)
|
* Jump to the C code in [main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c)
|
||||||
|
|
||||||
Let's look at the implementation.
|
Let's look at the implementation.
|
||||||
|
|
||||||
Segment registers align
|
Segment registers align
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
|
|
||||||
First of all it ensures that `ds` and `es` segment registers point to the same address and clears the direction flag with the `cld` instruction:
|
First of all, the kernel ensures that `ds` and `es` segment registers point to the same address. Next, it clears the direction flag using the `cld` instruction:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
movw %ds, %ax
|
movw %ds, %ax
|
||||||
@ -347,7 +347,7 @@ First of all it ensures that `ds` and `es` segment registers point to the same a
|
|||||||
cld
|
cld
|
||||||
```
|
```
|
||||||
|
|
||||||
As I wrote earlier, grub2 loads kernel setup code at address `0x10000` and `cs` at `0x1020` because execution doesn't start from the start of file, but from:
|
As I wrote earlier, grub2 loads kernel setup code at address `0x10000` and `cs` at `0x1020` because execution doesn't start from the start of file, but from
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
_start:
|
_start:
|
||||||
@ -355,7 +355,7 @@ _start:
|
|||||||
.byte start_of_setup-1f
|
.byte start_of_setup-1f
|
||||||
```
|
```
|
||||||
|
|
||||||
`jump`, which is at 512 bytes offset from the [4d 5a](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L47). It also needs to align `cs` from `0x10200` to `0x10000` as all other segment registers. After that we set up the stack:
|
`jump`, which is at a 512 byte offset from [4d 5a](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L47). It also needs to align `cs` from `0x10200` to `0x10000`, as well as all other segment registers. After that, we set up the stack:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
pushw %ds
|
pushw %ds
|
||||||
@ -363,12 +363,12 @@ _start:
|
|||||||
lretw
|
lretw
|
||||||
```
|
```
|
||||||
|
|
||||||
push `ds` value to the stack with the address of the [6](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L494) label and execute `lretw` instruction. When we call `lretw`, it loads address of label `6` into the [instruction pointer](https://en.wikipedia.org/wiki/Program_counter) register and `cs` with the value of `ds`. After this `ds` and `cs` will have the same values.
|
which pushes the value of `ds` to the stack with the address of the [6](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L494) label and executes the `lretw` instruction. When the `lretw` instruction is called, it loads the address of label `6` into the [instruction pointer](https://en.wikipedia.org/wiki/Program_counter) register and loads `cs` with the value of `ds`. Afterwards, `ds` and `cs` will have the same values.
|
||||||
|
|
||||||
Stack Setup
|
Stack Setup
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
|
|
||||||
Actually, almost all of the setup code is preparation for the C language environment in real mode. The next [step](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L467) is checking the `ss` register value and making a correct stack if `ss` is wrong:
|
Almost all of the setup code is in preparation for the C language environment in real mode. The next [step](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L467) is checking the `ss` register value and making a correct stack if `ss` is wrong:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
movw %ss, %dx
|
movw %ss, %dx
|
||||||
@ -379,13 +379,13 @@ Actually, almost all of the setup code is preparation for the C language environ
|
|||||||
|
|
||||||
This can lead to 3 different scenarios:
|
This can lead to 3 different scenarios:
|
||||||
|
|
||||||
* `ss` has valid value 0x10000 (as all other segment registers beside `cs`)
|
* `ss` has valid value 0x10000 (as do all other segment registers beside `cs`)
|
||||||
* `ss` is invalid and `CAN_USE_HEAP` flag is set (see below)
|
* `ss` is invalid and `CAN_USE_HEAP` flag is set (see below)
|
||||||
* `ss` is invalid and `CAN_USE_HEAP` flag is not set (see below)
|
* `ss` is invalid and `CAN_USE_HEAP` flag is not set (see below)
|
||||||
|
|
||||||
Let's look at all three of these scenarios:
|
Let's look at all three of these scenarios in turn:
|
||||||
|
|
||||||
* `ss` has a correct address (0x10000). In this case we go to label [2](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L481):
|
* `ss` has a correct address (0x10000). In this case, we go to label [2](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L481):
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
2: andw $~3, %dx
|
2: andw $~3, %dx
|
||||||
@ -396,11 +396,11 @@ Let's look at all three of these scenarios:
|
|||||||
sti
|
sti
|
||||||
```
|
```
|
||||||
|
|
||||||
Here we can see the alignment of `dx` (contains `sp` given by bootloader) to 4 bytes and a check for whether or not it is zero. If it is zero, we put `0xfffc` (4 byte aligned address before maximum segment size - 64 KB) in `dx`. If it is not zero we continue to use `sp` given by the bootloader (0xf7f4 in my case). After this we put the `ax` value to `ss` which stores the correct segment address of `0x10000` and sets up a correct `sp`. We now have a correct stack:
|
Here we can see the alignment of `dx` (contains `sp` given by bootloader) to 4 bytes and a check for whether or not it is zero. If it is zero, we put `0xfffc` (4 byte aligned address before the maximum segment size of 64 KB) in `dx`. If it is not zero, we continue to use `sp`, given by the bootloader (0xf7f4 in my case). After this, we put the `ax` value into `ss`, which stores the correct segment address of `0x10000` and sets up a correct `sp`. We now have a correct stack:
|
||||||
|
|
||||||
![stack](http://oi58.tinypic.com/16iwcis.jpg)
|
![stack](http://oi58.tinypic.com/16iwcis.jpg)
|
||||||
|
|
||||||
* In the second scenario, (`ss` != `ds`). First of all put the [_end](https://github.com/torvalds/linux/blob/master/arch/x86/boot/setup.ld#L52) (address of end of setup code) value in `dx` and check the `loadflags` header field with the `testb` instruction to see whether we can use the heap or not. [loadflags](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L321) is a bitmask header which is defined as:
|
* In the second scenario, (`ss` != `ds`). First, we put the value of [_end](https://github.com/torvalds/linux/blob/master/arch/x86/boot/setup.ld#L52) (the address of the end of the setup code) into `dx` and check the `loadflags` header field using the `testb` instruction to see whether we can use the heap. [loadflags](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L321) is a bitmask header which is defined as:
|
||||||
|
|
||||||
```C
|
```C
|
||||||
#define LOADED_HIGH (1<<0)
|
#define LOADED_HIGH (1<<0)
|
||||||
@ -409,7 +409,7 @@ Here we can see the alignment of `dx` (contains `sp` given by bootloader) to 4 b
|
|||||||
#define CAN_USE_HEAP (1<<7)
|
#define CAN_USE_HEAP (1<<7)
|
||||||
```
|
```
|
||||||
|
|
||||||
And as we can read in the boot protocol:
|
and, as we can read in the boot protocol,
|
||||||
|
|
||||||
```
|
```
|
||||||
Field name: loadflags
|
Field name: loadflags
|
||||||
@ -422,7 +422,7 @@ Field name: loadflags
|
|||||||
functionality will be disabled.
|
functionality will be disabled.
|
||||||
```
|
```
|
||||||
|
|
||||||
If the `CAN_USE_HEAP` bit is set, put `heap_end_ptr` in `dx` which points to `_end` and add `STACK_SIZE` (minimal stack size - 512 bytes) to it. After this if `dx` is not carry (it will not be carry, dx = _end + 512), jump to label `2` as in the previous case and make a correct stack.
|
If the `CAN_USE_HEAP` bit is set, we put `heap_end_ptr` into `dx` (which points to `_end`) and add `STACK_SIZE` (minimum stack size, 512 bytes) to it. After this, if `dx` is not carried (it will not be carried, dx = _end + 512), jump to label `2` (as in the previous case) and make a correct stack.
|
||||||
|
|
||||||
![stack](http://oi62.tinypic.com/dr7b5w.jpg)
|
![stack](http://oi62.tinypic.com/dr7b5w.jpg)
|
||||||
|
|
||||||
@ -433,7 +433,7 @@ If the `CAN_USE_HEAP` bit is set, put `heap_end_ptr` in `dx` which points to `_e
|
|||||||
BSS Setup
|
BSS Setup
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
|
|
||||||
The last two steps that need to happen before we can jump to the main C code, are setting up the [BSS](https://en.wikipedia.org/wiki/.bss) area and checking the "magic" signature. First, signature checking:
|
The last two steps that need to happen before we can jump to the main C code are setting up the [BSS](https://en.wikipedia.org/wiki/.bss) area and checking the "magic" signature. First, signature checking:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
cmpl $0x5a5aaa55, setup_sig
|
cmpl $0x5a5aaa55, setup_sig
|
||||||
@ -444,7 +444,7 @@ This simply compares the [setup_sig](https://github.com/torvalds/linux/blob/mast
|
|||||||
|
|
||||||
If the magic number matches, knowing we have a set of correct segment registers and a stack, we only need to set up the BSS section before jumping into the C code.
|
If the magic number matches, knowing we have a set of correct segment registers and a stack, we only need to set up the BSS section before jumping into the C code.
|
||||||
|
|
||||||
The BSS section is used to store statically allocated, uninitialized data. Linux carefully ensures this area of memory is first blanked, using the following code:
|
The BSS section is used to store statically allocated, uninitialized data. Linux carefully ensures this area of memory is first zeroed using the following code:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
movw $__bss_start, %di
|
movw $__bss_start, %di
|
||||||
@ -455,14 +455,14 @@ The BSS section is used to store statically allocated, uninitialized data. Linux
|
|||||||
rep; stosl
|
rep; stosl
|
||||||
```
|
```
|
||||||
|
|
||||||
First of all the [__bss_start](https://github.com/torvalds/linux/blob/master/arch/x86/boot/setup.ld#L47) address is moved into `di` and the `_end + 3` address (+3 - aligns to 4 bytes) is moved into `cx`. The `eax` register is cleared (using a `xor` instruction), and the bss section size (`cx`-`di`) is calculated and put into `cx`. Then, `cx` is divided by four (the size of a 'word'), and the `stosl` instruction is repeatedly used, storing the value of `eax` (zero) into the address pointed to by `di`, automatically increasing `di` by four (this occurs until `cx` reaches zero). The net effect of this code is that zeros are written through all words in memory from `__bss_start` to `_end`:
|
First, the [__bss_start](https://github.com/torvalds/linux/blob/master/arch/x86/boot/setup.ld#L47) address is moved into `di`. Next, the `_end + 3` address (+3 - aligns to 4 bytes) is moved into `cx`. The `eax` register is cleared (using a `xor` instruction), and the bss section size (`cx`-`di`) is calculated and put into `cx`. Then, `cx` is divided by four (the size of a 'word'), and the `stosl` instruction is used repeatedly, storing the value of `eax` (zero) into the address pointed to by `di`, automatically increasing `di` by four, repeating until `cx` reaches zero). The net effect of this code is that zeros are written through all words in memory from `__bss_start` to `_end`:
|
||||||
|
|
||||||
![bss](http://oi59.tinypic.com/29m2eyr.jpg)
|
![bss](http://oi59.tinypic.com/29m2eyr.jpg)
|
||||||
|
|
||||||
Jump to main
|
Jump to main
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
|
|
||||||
That's all, we have the stack and BSS so we can jump to the `main()` C function:
|
That's all - we have the stack and BSS, so we can jump to the `main()` C function:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
calll main
|
calll main
|
||||||
@ -473,7 +473,7 @@ The `main()` function is located in [arch/x86/boot/main.c](https://github.com/to
|
|||||||
Conclusion
|
Conclusion
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
|
|
||||||
This is the end of the first part about Linux kernel insides. If you have questions or suggestions, ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-internals/issues/new). In the next part we will see first C code which executes in Linux kernel setup, implementation of memory routines as `memset`, `memcpy`, `earlyprintk` implementation and early console initialization and many more.
|
This is the end of the first part about Linux kernel insides. If you have questions or suggestions, ping me on twitter [0xAX](https://twitter.com/0xAX), drop me an [email](anotherworldofworld@gmail.com), or just create an [issue](https://github.com/0xAX/linux-internals/issues/new). In the next part, we will see the first C code that executes in the Linux kernel setup, the implementation of memory routines such as `memset`, `memcpy`, `earlyprintk`, early console implementation and initialization, and much more.
|
||||||
|
|
||||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-internals).**
|
**Please note that English is not my first language and I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-internals).**
|
||||||
|
|
||||||
|
@ -96,3 +96,4 @@ Thank you to all contributors:
|
|||||||
* [Connor Mullen](https://github.com/mullen3)
|
* [Connor Mullen](https://github.com/mullen3)
|
||||||
* [Alex Gonzalez](https://github.com/alex-gonz)
|
* [Alex Gonzalez](https://github.com/alex-gonz)
|
||||||
* [Tim Konick](https://github.com/tijko)
|
* [Tim Konick](https://github.com/tijko)
|
||||||
|
* [Anastas Stoyanovsky](https://github.com/anastasds)
|
||||||
|
Loading…
Reference in New Issue
Block a user