mirror of
https://github.com/0xAX/linux-insides.git
synced 2024-12-31 19:00:58 +00:00
Merge branch '0xAX-master' into original
This commit is contained in:
commit
a8b412b056
2
.gitignore
vendored
Normal file
2
.gitignore
vendored
Normal file
@ -0,0 +1,2 @@
|
||||
*.tex
|
||||
build
|
@ -1,10 +1,11 @@
|
||||
# Kernel Boot Process
|
||||
|
||||
This chapter describes the linux kernel boot process. Here you will see a
|
||||
couple of posts which describes the full cycle of the kernel loading process:
|
||||
series of posts which describes the full cycle of the kernel loading process:
|
||||
|
||||
* [From the bootloader to kernel](linux-bootstrap-1.md) - describes all stages from turning on the computer to running the first instruction of the kernel.
|
||||
* [First steps in the kernel setup code](linux-bootstrap-2.md) - describes first steps in the kernel setup code. You will see heap initialization, query of different parameters like EDD, IST and etc...
|
||||
* [Video mode initialization and transition to protected mode](linux-bootstrap-3.md) - describes video mode initialization in the kernel setup code and transition to protected mode.
|
||||
* [Transition to 64-bit mode](linux-bootstrap-4.md) - describes preparation for transition into 64-bit mode and details of transition.
|
||||
* [Kernel Decompression](linux-bootstrap-5.md) - describes preparation before kernel decompression and details of direct decompression.
|
||||
* [Kernel random address randomization](linux-bootstrap-6.md) - describes randomization of the Linux kernel load address.
|
||||
|
@ -21,7 +21,7 @@ All posts will also be accessible at [github repo](https://github.com/0xAX/linux
|
||||
|
||||
Anyway, if you are just starting to learn such tools, I will try to explain some parts during this and the following posts. Alright, this is the end of the simple introduction, and now we can start to dive into the Linux kernel and low-level stuff.
|
||||
|
||||
I've started to write this book at the time of the `3.18` Linux kernel, and many things might change from that time. If there are changes, I will update the posts accordingly.
|
||||
I've started writing this book at the time of the `3.18` Linux kernel, and many things might have changed since that time. If there are changes, I will update the posts accordingly.
|
||||
|
||||
The Magical Power Button, What happens next?
|
||||
--------------------------------------------------------------------------------
|
||||
@ -38,7 +38,7 @@ CS base 0xffff0000
|
||||
|
||||
The processor starts working in [real mode](https://en.wikipedia.org/wiki/Real_mode). Let's back up a little and try to understand [memory segmentation](https://en.wikipedia.org/wiki/Memory_segmentation) in this mode. Real mode is supported on all x86-compatible processors, from the [8086](https://en.wikipedia.org/wiki/Intel_8086) CPU all the way to the modern Intel 64-bit CPUs. The `8086` processor has a 20-bit address bus, which means that it could work with a `0-0xFFFFF` or `1 megabyte` address space. But it only has `16-bit` registers, which have a maximum address of `2^16 - 1` or `0xffff` (64 kilobytes).
|
||||
|
||||
[Memory segmentation](http://en.wikipedia.org/wiki/Memory_segmentation) is used to make use of all the address space available. All memory is divided into small, fixed-size segments of `65536` bytes (64 KB). Since we cannot address memory above `64 KB` with 16-bit registers, an alternate method was devised.
|
||||
[Memory segmentation](https://en.wikipedia.org/wiki/Memory_segmentation) is used to make use of all the address space available. All memory is divided into small, fixed-size segments of `65536` bytes (64 KB). Since we cannot address memory above `64 KB` with 16-bit registers, an alternate method was devised.
|
||||
|
||||
An address consists of two parts: a segment selector, which has a base address, and an offset from this base address. In real mode, the associated base address of a segment selector is `Segment Selector * 16`. Thus, to get a physical address in memory, we need to multiply the segment selector part by `16` and add the offset to it:
|
||||
|
||||
@ -73,29 +73,31 @@ The starting address is formed by adding the base address to the value in the EI
|
||||
'0xfffffff0'
|
||||
```
|
||||
|
||||
We get `0xfffffff0`, which is 16 bytes below 4GB. This point is called the [Reset vector](http://en.wikipedia.org/wiki/Reset_vector). This is the memory location at which the CPU expects to find the first instruction to execute after reset. It contains a [jump](http://en.wikipedia.org/wiki/JMP_%28x86_instruction%29) (`jmp`) instruction that usually points to the BIOS entry point. For example, if we look in the [coreboot](http://www.coreboot.org/) source code (`src/cpu/x86/16bit/reset16.inc`), we will see:
|
||||
We get `0xfffffff0`, which is 16 bytes below 4GB. This point is called the [Reset vector](https://en.wikipedia.org/wiki/Reset_vector). This is the memory location at which the CPU expects to find the first instruction to execute after reset. It contains a [jump](https://en.wikipedia.org/wiki/JMP_%28x86_instruction%29) (`jmp`) instruction that usually points to the BIOS entry point. For example, if we look in the [coreboot](https://www.coreboot.org/) source code (`src/cpu/x86/16bit/reset16.inc`), we will see:
|
||||
|
||||
```assembly
|
||||
.section ".reset"
|
||||
.section ".reset", "ax", %progbits
|
||||
.code16
|
||||
.globl reset_vector
|
||||
reset_vector:
|
||||
.globl _start
|
||||
_start:
|
||||
.byte 0xe9
|
||||
.int _start - ( . + 2 )
|
||||
.int _start16bit - ( . + 2 )
|
||||
...
|
||||
```
|
||||
|
||||
Here we can see the `jmp` instruction [opcode](http://ref.x86asm.net/coder32.html#xE9), which is `0xe9`, and its destination address at `_start - ( . + 2)`.
|
||||
Here we can see the `jmp` instruction [opcode](http://ref.x86asm.net/coder32.html#xE9), which is `0xe9`, and its destination address at `_start16bit - ( . + 2)`.
|
||||
|
||||
We can also see that the `reset` section is `16` bytes and that is compiled to start from `0xfffffff0` address (`src/cpu/x86/16bit/reset16.lds`):
|
||||
We can also see that the `reset` section is `16` bytes and that is compiled to start from `0xfffffff0` address (`src/cpu/x86/16bit/reset16.ld`):
|
||||
|
||||
```
|
||||
SECTIONS {
|
||||
/* Trigger an error if I have an unuseable start address */
|
||||
_bogus = ASSERT(_start16bit >= 0xffff0000, "_start16bit too low. Please report.");
|
||||
_ROMTOP = 0xfffffff0;
|
||||
. = _ROMTOP;
|
||||
.reset . : {
|
||||
*(.reset)
|
||||
. = 15 ;
|
||||
*(.reset);
|
||||
. = 15;
|
||||
BYTE(0x00);
|
||||
}
|
||||
}
|
||||
@ -164,7 +166,7 @@ just as explained above. We have only 16-bit general purpose registers; the maxi
|
||||
|
||||
where `0x10ffef` is equal to `1MB + 64KB - 16b`. An [8086](https://en.wikipedia.org/wiki/Intel_8086) processor (which was the first processor with real mode), in contrast, has a 20-bit address line. Since `2^20 = 1048576` is 1MB, this means that the actual available memory is 1MB.
|
||||
|
||||
General real mode's memory map is as follows:
|
||||
In general, real mode's memory map is as follows:
|
||||
|
||||
```
|
||||
0x00000000 - 0x000003FF - Real Mode Interrupt Vector Table
|
||||
@ -180,7 +182,7 @@ General real mode's memory map is as follows:
|
||||
0x000F0000 - 0x000FFFFF - System BIOS
|
||||
```
|
||||
|
||||
In the beginning of this post, I wrote that the first instruction executed by the CPU is located at address `0xFFFFFFF0`, which is much larger than `0xFFFFF` (1MB). How can the CPU access this address in real mode? The answer is in the [coreboot](http://www.coreboot.org/Developer_Manual/Memory_map) documentation:
|
||||
In the beginning of this post, I wrote that the first instruction executed by the CPU is located at address `0xFFFFFFF0`, which is much larger than `0xFFFFF` (1MB). How can the CPU access this address in real mode? The answer is in the [coreboot](https://www.coreboot.org/Developer_Manual/Memory_map) documentation:
|
||||
|
||||
```
|
||||
0xFFFE_0000 - 0xFFFF_FFFF: 128 kilobyte ROM mapped into address space
|
||||
@ -193,11 +195,11 @@ Bootloader
|
||||
|
||||
There are a number of bootloaders that can boot Linux, such as [GRUB 2](https://www.gnu.org/software/grub/) and [syslinux](http://www.syslinux.org/wiki/index.php/The_Syslinux_Project). The Linux kernel has a [Boot protocol](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/x86/boot.txt) which specifies the requirements for a bootloader to implement Linux support. This example will describe GRUB 2.
|
||||
|
||||
Continuing from before, now that the `BIOS` has chosen a boot device and transferred control to the boot sector code, execution starts from [boot.img](http://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/boot/i386/pc/boot.S;hb=HEAD). This code is very simple, due to the limited amount of space available, and contains a pointer which is used to jump to the location of GRUB 2's core image. The core image begins with [diskboot.img](http://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/boot/i386/pc/diskboot.S;hb=HEAD), which is usually stored immediately after the first sector in the unused space before the first partition. The above code loads the rest of the core image, which contains GRUB 2's kernel and drivers for handling filesystems, into memory. After loading the rest of the core image, it executes [grub_main](http://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/kern/main.c) function.
|
||||
Continuing from before, now that the `BIOS` has chosen a boot device and transferred control to the boot sector code, execution starts from [boot.img](http://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/boot/i386/pc/boot.S;hb=HEAD). This code is very simple, due to the limited amount of space available, and contains a pointer which is used to jump to the location of GRUB 2's core image. The core image begins with [diskboot.img](http://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/boot/i386/pc/diskboot.S;hb=HEAD), which is usually stored immediately after the first sector in the unused space before the first partition. The above code loads the rest of the core image, which contains GRUB 2's kernel and drivers for handling filesystems, into memory. After loading the rest of the core image, it executes the [grub_main](http://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/kern/main.c) function.
|
||||
|
||||
The `grub_main` function initializes the console, gets the base address for modules, sets the root device, loads/parses the grub configuration file, loads modules, etc. At the end of execution, `grub_main` function moves grub to normal mode. The `grub_normal_execute` function (from the `grub-core/normal/main.c` source code file) completes the final preparations and shows a menu to select an operating system. When we select one of the grub menu entries, the `grub_menu_execute_entry` function runs, executing the grub `boot` command and booting the selected operating system.
|
||||
The `grub_main` function initializes the console, gets the base address for modules, sets the root device, loads/parses the grub configuration file, loads modules, etc. At the end of execution, the `grub_main` function moves grub to normal mode. The `grub_normal_execute` function (from the `grub-core/normal/main.c` source code file) completes the final preparations and shows a menu to select an operating system. When we select one of the grub menu entries, the `grub_menu_execute_entry` function runs, executing the grub `boot` command and booting the selected operating system.
|
||||
|
||||
As we can read in the kernel boot protocol, the bootloader must read and fill some fields of the kernel setup header, which starts at the `0x01f1` offset from the kernel setup code. You may look at the boot [linker script](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/setup.ld#L16) to make sure in this offset. The kernel header [arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/header.S) starts from:
|
||||
As we can read in the kernel boot protocol, the bootloader must read and fill some fields of the kernel setup header, which starts at the `0x01f1` offset from the kernel setup code. You may look at the boot [linker script](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/setup.ld#L16) to confirm the value of this offset. The kernel header [arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/header.S) starts from:
|
||||
|
||||
```assembly
|
||||
.globl hdr
|
||||
@ -211,9 +213,9 @@ hdr:
|
||||
boot_flag: .word 0xAA55
|
||||
```
|
||||
|
||||
The bootloader must fill this and the rest of the headers (which are only marked as being type `write` in the Linux boot protocol, such as in [this example](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/x86/boot.txt#L354)) with values which it has either received from the command line or calculated during boot. (We will not go over full descriptions and explanations for all fields of the kernel setup header now but instead when the discuss how kernel uses them; you can find a description of all fields in the [boot protocol](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/x86/boot.txt#L156).)
|
||||
The bootloader must fill this and the rest of the headers (which are only marked as being type `write` in the Linux boot protocol, such as in [this example](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/x86/boot.txt#L354)) with values which it has either received from the command line or calculated during boot. (We will not go over full descriptions and explanations for all fields of the kernel setup header now, but we shall do so when we discuss how the kernel uses them; you can find a description of all fields in the [boot protocol](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/x86/boot.txt#L156).)
|
||||
|
||||
As we can see in the kernel boot protocol, the memory map will be the following after loading the kernel:
|
||||
As we can see in the kernel boot protocol, the memory will be mapped as follows after loading the kernel:
|
||||
|
||||
```shell
|
||||
| Protected-mode kernel |
|
||||
@ -252,10 +254,10 @@ where `X` is the address of the kernel boot sector being loaded. In my case, `X`
|
||||
|
||||
The bootloader has now loaded the Linux kernel into memory, filled the header fields, and then jumped to the corresponding memory address. We can now move directly to the kernel setup code.
|
||||
|
||||
Start of Kernel Setup
|
||||
The Beginning of the Kernel Setup Stage
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Finally, we are in the kernel! Technically, the kernel hasn't run yet; first, the kernel setup part must configure some stuff like decompressor, memory management related things and etc. After all of such things, kernel setup part will decompress actual kernel and jump on it. Execution of setup part starts from [arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/header.S) at [_start](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/header.S#L292). It is a little strange at first sight, as there are several instructions before it.
|
||||
Finally, we are in the kernel! Technically, the kernel hasn't run yet; first, the kernel setup part must configure stuff such as the decompressor and some memory management related things, to name a few. After all these things are done, the kernel setup part will decompress the actual kernel and jump to it. Execution of the setup part starts from [arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/header.S) at [_start](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/header.S#L292). It is a little strange at first sight, as there are several instructions before it.
|
||||
|
||||
A long time ago, the Linux kernel used to have its own bootloader. Now, however, if you run, for example,
|
||||
|
||||
@ -267,7 +269,7 @@ then you will see:
|
||||
|
||||
![Try vmlinuz in qemu](http://oi60.tinypic.com/r02xkz.jpg)
|
||||
|
||||
Actually, the `header.S` starts from [MZ](https://en.wikipedia.org/wiki/DOS_MZ_executable) (see image above), the error message printing and following the [PE](https://en.wikipedia.org/wiki/Portable_Executable) header:
|
||||
Actually, the file `header.S` starts with the magic number [MZ](https://en.wikipedia.org/wiki/DOS_MZ_executable) (see image above), the error message that displays and, following that, the [PE](https://en.wikipedia.org/wiki/Portable_Executable) header:
|
||||
|
||||
```assembly
|
||||
#ifdef CONFIG_EFI_STUB
|
||||
@ -293,7 +295,7 @@ The actual kernel setup entry point is:
|
||||
_start:
|
||||
```
|
||||
|
||||
The bootloader (grub2 and others) knows about this point (`0x200` offset from `MZ`) and makes a jump directly to it, despite the fact that `header.S` starts from the `.bstext` section, which prints an error message:
|
||||
The bootloader (grub2 and others) knows about this point (at an offset of `0x200` from `MZ`) and makes a jump directly to it, despite the fact that `header.S` starts from the `.bstext` section, which prints an error message:
|
||||
|
||||
```
|
||||
//
|
||||
@ -317,9 +319,9 @@ _start:
|
||||
//
|
||||
```
|
||||
|
||||
Here we can see a `jmp` instruction opcode (`0xeb`) that jumps to the `start_of_setup-1f` point. In `Nf` notation, `2f` refers to the following local `2:` label; in our case, it is label `1` that is present right after the jump, and it contains the rest of the setup [header](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/x86/boot.txt#L156). Right after the setup header, we see the `.entrytext` section, which starts at the `start_of_setup` label.
|
||||
Here we can see a `jmp` instruction opcode (`0xeb`) that jumps to the `start_of_setup-1f` point. In `Nf` notation, `2f`, for example, refers to the local label `2:`; in our case, it is the label `1` that is present right after the jump, and it contains the rest of the setup [header](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/x86/boot.txt#L156). Right after the setup header, we see the `.entrytext` section, which starts at the `start_of_setup` label.
|
||||
|
||||
This is the first code that actually runs (aside from the previous jump instructions, of course). After the kernel setup part received control from the bootloader, the first `jmp` instruction is located at the `0x200` offset from the start of the kernel real mode, i.e., after the first 512 bytes. This we can both read in the Linux kernel boot protocol and see in the grub2 source code:
|
||||
This is the first code that actually runs (aside from the previous jump instructions, of course). After the kernel setup part receives control from the bootloader, the first `jmp` instruction is located at the `0x200` offset from the start of the kernel real mode, i.e., after the first 512 bytes. This can be seen in both the Linux kernel boot protocol and the grub2 source code:
|
||||
|
||||
```C
|
||||
segment = grub_linux_real_target >> 4;
|
||||
@ -345,10 +347,10 @@ After the jump to `start_of_setup`, the kernel needs to do the following:
|
||||
|
||||
Let's look at the implementation.
|
||||
|
||||
Segment registers align
|
||||
Aligning the Segment Registers
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
First of all, the kernel ensures that `ds` and `es` segment registers point to the same address. Next, it clears the direction flag using the `cld` instruction:
|
||||
First of all, the kernel ensures that the `ds` and `es` segment registers point to the same address. Next, it clears the direction flag using the `cld` instruction:
|
||||
|
||||
```assembly
|
||||
movw %ds, %ax
|
||||
@ -356,7 +358,7 @@ First of all, the kernel ensures that `ds` and `es` segment registers point to t
|
||||
cld
|
||||
```
|
||||
|
||||
As I wrote earlier, `grub2` loads kernel setup code at address `0x10000` by default and `cs` at `0x10200` because execution doesn't start from the start of file, but from:
|
||||
As I wrote earlier, `grub2` loads kernel setup code at address `0x10000` by default and `cs` at `0x10200` because execution doesn't start from the start of file, but from the jump here:
|
||||
|
||||
```assembly
|
||||
_start:
|
||||
@ -364,7 +366,7 @@ _start:
|
||||
.byte start_of_setup-1f
|
||||
```
|
||||
|
||||
jump, which is at a `512` byte offset from [4d 5a](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/header.S#L46). It also needs to align `cs` from `0x10200` to `0x10000`, as well as all other segment registers. After that, we set up the stack:
|
||||
which is at a `512` byte offset from [4d 5a](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/header.S#L46). We also need to align `cs` from `0x10200` to `0x10000`, as well as all other segment registers. After that, we set up the stack:
|
||||
|
||||
```assembly
|
||||
pushw %ds
|
||||
@ -372,7 +374,7 @@ jump, which is at a `512` byte offset from [4d 5a](https://github.com/torvalds/l
|
||||
lretw
|
||||
```
|
||||
|
||||
which pushes the value of `ds` to the stack with the address of the [6](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/header.S#L494) label and executes the `lretw` instruction. When the `lretw` instruction is called, it loads the address of label `6` into the [instruction pointer](https://en.wikipedia.org/wiki/Program_counter) register and loads `cs` with the value of `ds`. Afterward, `ds` and `cs` will have the same values.
|
||||
which pushes the value of `ds` to the stack, followed by the address of the [6](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/header.S#L494) label and executes the `lretw` instruction. When the `lretw` instruction is called, it loads the address of label `6` into the [instruction pointer](https://en.wikipedia.org/wiki/Program_counter) register and loads `cs` with the value of `ds`. Afterward, `ds` and `cs` will have the same values.
|
||||
|
||||
Stack Setup
|
||||
--------------------------------------------------------------------------------
|
||||
@ -388,9 +390,9 @@ Almost all of the setup code is in preparation for the C language environment in
|
||||
|
||||
This can lead to 3 different scenarios:
|
||||
|
||||
* `ss` has valid value `0x1000` (as do all other segment registers beside `cs`)
|
||||
* `ss` is invalid and `CAN_USE_HEAP` flag is set (see below)
|
||||
* `ss` is invalid and `CAN_USE_HEAP` flag is not set (see below)
|
||||
* `ss` has a valid value `0x1000` (as do all the other segment registers beside `cs`)
|
||||
* `ss` is invalid and the `CAN_USE_HEAP` flag is set (see below)
|
||||
* `ss` is invalid and the `CAN_USE_HEAP` flag is not set (see below)
|
||||
|
||||
Let's look at all three of these scenarios in turn:
|
||||
|
||||
@ -405,7 +407,7 @@ Let's look at all three of these scenarios in turn:
|
||||
sti
|
||||
```
|
||||
|
||||
Here we can see the alignment of `dx` (contains `sp` given by bootloader) to `4` bytes and a check for whether or not it is zero. If it is zero, we put `0xfffc` (4 byte aligned address before the maximum segment size of 64 KB) in `dx`. If it is not zero, we continue to use `sp`, given by the bootloader (0xf7f4 in my case). After this, we put the `ax` value into `ss`, which stores the correct segment address of `0x1000` and sets up a correct `sp`. We now have a correct stack:
|
||||
Here we set the alignment of `dx` (which contains the value of `sp` as given by the bootloader) to `4` bytes and a check for whether or not it is zero. If it is zero, we put `0xfffc` (4 byte aligned address before the maximum segment size of 64 KB) in `dx`. If it is not zero, we continue to use the value of `sp` given by the bootloader (0xf7f4 in my case). After this, we put the value of `ax` into `ss`, which stores the correct segment address of `0x1000` and sets up a correct `sp`. We now have a correct stack:
|
||||
|
||||
![stack](http://oi58.tinypic.com/16iwcis.jpg)
|
||||
|
||||
@ -491,12 +493,12 @@ Links
|
||||
|
||||
* [Intel 80386 programmer's reference manual 1986](http://css.csail.mit.edu/6.858/2014/readings/i386.pdf)
|
||||
* [Minimal Boot Loader for Intel® Architecture](https://www.cs.cmu.edu/~410/doc/minimal_boot.pdf)
|
||||
* [8086](http://en.wikipedia.org/wiki/Intel_8086)
|
||||
* [80386](http://en.wikipedia.org/wiki/Intel_80386)
|
||||
* [Reset vector](http://en.wikipedia.org/wiki/Reset_vector)
|
||||
* [Real mode](http://en.wikipedia.org/wiki/Real_mode)
|
||||
* [8086](https://en.wikipedia.org/wiki/Intel_8086)
|
||||
* [80386](https://en.wikipedia.org/wiki/Intel_80386)
|
||||
* [Reset vector](https://en.wikipedia.org/wiki/Reset_vector)
|
||||
* [Real mode](https://en.wikipedia.org/wiki/Real_mode)
|
||||
* [Linux kernel boot protocol](https://www.kernel.org/doc/Documentation/x86/boot.txt)
|
||||
* [CoreBoot developer manual](http://www.coreboot.org/Developer_Manual)
|
||||
* [coreboot developer manual](https://www.coreboot.org/Developer_Manual)
|
||||
* [Ralf Brown's Interrupt List](http://www.ctyme.com/intr/int.htm)
|
||||
* [Power supply](http://en.wikipedia.org/wiki/Power_supply)
|
||||
* [Power good signal](http://en.wikipedia.org/wiki/Power_good_signal)
|
||||
* [Power supply](https://en.wikipedia.org/wiki/Power_supply)
|
||||
* [Power good signal](https://en.wikipedia.org/wiki/Power_good_signal)
|
||||
|
@ -4,13 +4,13 @@ Kernel booting process. Part 2.
|
||||
First steps in the kernel setup
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
We started to dive into linux kernel insides in the previous [part](linux-bootstrap-1.md) and saw the initial part of the kernel setup code. We stopped at the first call to the `main` function (which is the first function written in C) from [arch/x86/boot/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/main.c).
|
||||
We started to dive into the linux kernel's insides in the previous [part](linux-bootstrap-1.md) and saw the initial part of the kernel setup code. We stopped at the first call to the `main` function (which is the first function written in C) from [arch/x86/boot/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/main.c).
|
||||
|
||||
In this part, we will continue to research the kernel setup code and
|
||||
* see what `protected mode` is,
|
||||
* some preparation for the transition into it,
|
||||
* the heap and console initialization,
|
||||
* memory detection, CPU validation, keyboard initialization
|
||||
In this part, we will continue to research the kernel setup code and go over
|
||||
* what `protected mode` is,
|
||||
* the transition into it,
|
||||
* the initialization of the heap and the console,
|
||||
* memory detection, CPU validation and keyboard initialization
|
||||
* and much much more.
|
||||
|
||||
So, Let's go ahead.
|
||||
@ -22,16 +22,16 @@ Before we can move to the native Intel64 [Long Mode](http://en.wikipedia.org/wik
|
||||
|
||||
What is [protected mode](https://en.wikipedia.org/wiki/Protected_mode)? Protected mode was first added to the x86 architecture in 1982 and was the main mode of Intel processors from the [80286](http://en.wikipedia.org/wiki/Intel_80286) processor until Intel 64 and long mode came.
|
||||
|
||||
The main reason to move away from [Real mode](http://wiki.osdev.org/Real_Mode) is that there is very limited access to the RAM. As you may remember from the previous part, there is only 2<sup>20</sup> bytes or 1 Megabyte, sometimes even only 640 Kilobytes of RAM available in the Real mode.
|
||||
The main reason to move away from [Real mode](http://wiki.osdev.org/Real_Mode) is that there is very limited access to the RAM. As you may remember from the previous part, there are only 2<sup>20</sup> bytes or 1 Megabyte, sometimes even only 640 Kilobytes of RAM available in the Real mode.
|
||||
|
||||
Protected mode brought many changes, but the main one is the difference in memory management. The 20-bit address bus was replaced with a 32-bit address bus. It allowed access to 4 Gigabytes of memory vs 1 Megabyte of real mode. Also, [paging](http://en.wikipedia.org/wiki/Paging) support was added, which you can read about in the next sections.
|
||||
Protected mode brought many changes, but the main one is the difference in memory management. The 20-bit address bus was replaced with a 32-bit address bus. It allowed access to 4 Gigabytes of memory vs the 1 Megabyte in real mode. Also, [paging](http://en.wikipedia.org/wiki/Paging) support was added, which you can read about in the next sections.
|
||||
|
||||
Memory management in Protected mode is divided into two, almost independent parts:
|
||||
|
||||
* Segmentation
|
||||
* Paging
|
||||
|
||||
Here we will only see segmentation. Paging will be discussed in the next sections.
|
||||
Here we will only talk about segmentation. Paging will be discussed in the next sections.
|
||||
|
||||
As you can read in the previous part, addresses consist of two parts in real mode:
|
||||
|
||||
@ -44,123 +44,123 @@ And we can get the physical address if we know these two parts by:
|
||||
PhysicalAddress = Segment Selector * 16 + Offset
|
||||
```
|
||||
|
||||
Memory segmentation was completely redone in protected mode. There are no 64 Kilobyte fixed-size segments. Instead, the size and location of each segment is described by an associated data structure called _Segment Descriptor_. The segment descriptors are stored in a data structure called `Global Descriptor Table` (GDT).
|
||||
Memory segmentation was completely redone in protected mode. There are no 64 Kilobyte fixed-size segments. Instead, the size and location of each segment is described by an associated data structure called the _Segment Descriptor_. The segment descriptors are stored in a data structure called the `Global Descriptor Table` (GDT).
|
||||
|
||||
The GDT is a structure which resides in memory. It has no fixed place in the memory so, its address is stored in the special `GDTR` register. Later we will see the GDT loading in the Linux kernel code. There will be an operation for loading it into memory, something like:
|
||||
The GDT is a structure which resides in memory. It has no fixed place in the memory so, its address is stored in the special `GDTR` register. Later we will see how the GDT is loaded in the Linux kernel code. There will be an operation for loading it into memory, something like:
|
||||
|
||||
```assembly
|
||||
lgdt gdt
|
||||
```
|
||||
|
||||
where the `lgdt` instruction loads the base address and limit(size) of global descriptor table to the `GDTR` register. `GDTR` is a 48-bit register and consists of two parts:
|
||||
where the `lgdt` instruction loads the base address and limit(size) of the global descriptor table to the `GDTR` register. `GDTR` is a 48-bit register and consists of two parts:
|
||||
|
||||
* size(16-bit) of global descriptor table;
|
||||
* address(32-bit) of the global descriptor table.
|
||||
* the size(16-bit) of the global descriptor table;
|
||||
* the address(32-bit) of the global descriptor table.
|
||||
|
||||
As mentioned above the GDT contains `segment descriptors` which describe memory segments. Each descriptor is 64-bits in size. The general scheme of a descriptor is:
|
||||
|
||||
```
|
||||
63 56 51 48 45 39 32
|
||||
63 56 51 48 45 39 32
|
||||
------------------------------------------------------------
|
||||
| | |B| |A| | | | |0|E|W|A| |
|
||||
| BASE 31:24 |G|/|L|V| LIMIT |P|DPL|S| TYPE | BASE 23:16 |
|
||||
| | |D| |L| 19:16 | | | |1|C|R|A| |
|
||||
------------------------------------------------------------
|
||||
| | | D | | L | 19:16 | | | | 1 | C | R | A | |
|
||||
| --- |
|
||||
|
||||
31 16 15 0
|
||||
31 16 15 0
|
||||
------------------------------------------------------------
|
||||
| | |
|
||||
| BASE 15:0 | LIMIT 15:0 |
|
||||
| | |
|
||||
------------------------------------------------------------
|
||||
| | |
|
||||
| --- |
|
||||
```
|
||||
|
||||
Don't worry, I know it looks a little scary after real mode, but it's easy. For example LIMIT 15:0 means that bits 0-15 of Limit are located in the beginning of the Descriptor. The rest of it is in LIMIT 19:16, which is located at bits 48-51 of the Descriptor. So, the size of Limit is 0-19 i.e 20-bits. Let's take a closer look at it:
|
||||
Don't worry, I know it looks a little scary after real mode, but it's easy. For example LIMIT 15:0 means that bits 0-15 of Limit are located at the beginning of the Descriptor. The rest of it is in LIMIT 19:16, which is located at bits 48-51 of the Descriptor. So, the size of Limit is 0-19 i.e 20-bits. Let's take a closer look at it:
|
||||
|
||||
1. Limit[20-bits] is at 0-15, 48-51 bits. It defines `length_of_segment - 1`. It depends on `G`(Granularity) bit.
|
||||
1. Limit[20-bits] is split between bits 0-15 and 48-51. It defines the `length_of_segment - 1`. It depends on the `G`(Granularity) bit.
|
||||
|
||||
* if `G` (bit 55) is 0 and segment limit is 0, the size of the segment is 1 Byte
|
||||
* if `G` is 1 and segment limit is 0, the size of the segment is 4096 Bytes
|
||||
* if `G` is 0 and segment limit is 0xfffff, the size of the segment is 1 Megabyte
|
||||
* if `G` is 1 and segment limit is 0xfffff, the size of the segment is 4 Gigabytes
|
||||
* if `G` (bit 55) is 0 and the segment limit is 0, the size of the segment is 1 Byte
|
||||
* if `G` is 1 and the segment limit is 0, the size of the segment is 4096 Bytes
|
||||
* if `G` is 0 and the segment limit is 0xfffff, the size of the segment is 1 Megabyte
|
||||
* if `G` is 1 and the segment limit is 0xfffff, the size of the segment is 4 Gigabytes
|
||||
|
||||
So, it means that if
|
||||
So, what this means is
|
||||
* if G is 0, Limit is interpreted in terms of 1 Byte and the maximum size of the segment can be 1 Megabyte.
|
||||
* if G is 1, Limit is interpreted in terms of 4096 Bytes = 4 KBytes = 1 Page and the maximum size of the segment can be 4 Gigabytes. Actually, when G is 1, the value of Limit is shifted to the left by 12 bits. So, 20 bits + 12 bits = 32 bits and 2<sup>32</sup> = 4 Gigabytes.
|
||||
|
||||
2. Base[32-bits] is at 16-31, 32-39 and 56-63 bits. It defines the physical address of the segment's starting location.
|
||||
2. Base[32-bits] is split between bits 16-31, 32-39 and 56-63. It defines the physical address of the segment's starting location.
|
||||
|
||||
3. Type/Attribute[5-bits] is at 40-44 bits. It defines the type of segment and kinds of access to it.
|
||||
* `S` flag at bit 44 specifies descriptor type. If `S` is 0 then this segment is a system segment, whereas if `S` is 1 then this is a code or data segment (Stack segments are data segments which must be read/write segments).
|
||||
3. Type/Attribute[5-bits] is represented by bits 40-44. It defines the type of segment and how it can be accessed.
|
||||
* The `S` flag at bit 44 specifies the descriptor type. If `S` is 0 then this segment is a system segment, whereas if `S` is 1 then this is a code or data segment (Stack segments are data segments which must be read/write segments).
|
||||
|
||||
To determine if the segment is a code or data segment we can check its Ex(bit 43) Attribute marked as 0 in the above diagram. If it is 0, then the segment is a Data segment otherwise it is a code segment.
|
||||
To determine if the segment is a code or data segment, we can check its Ex(bit 43) Attribute (marked as 0 in the above diagram). If it is 0, then the segment is a Data segment, otherwise, it is a code segment.
|
||||
|
||||
A segment can be of one of the following types:
|
||||
|
||||
```
|
||||
| Type Field | Descriptor Type | Description
|
||||
|-----------------------------|-----------------|------------------
|
||||
| Type Field | Descriptor Type | Description |
|
||||
| --------------------------- | --------------- | ---------------------------------- |
|
||||
| Decimal | |
|
||||
| 0 E W A | |
|
||||
| 0 0 0 0 0 | Data | Read-Only
|
||||
| 1 0 0 0 1 | Data | Read-Only, accessed
|
||||
| 2 0 0 1 0 | Data | Read/Write
|
||||
| 3 0 0 1 1 | Data | Read/Write, accessed
|
||||
| 4 0 1 0 0 | Data | Read-Only, expand-down
|
||||
| 5 0 1 0 1 | Data | Read-Only, expand-down, accessed
|
||||
| 6 0 1 1 0 | Data | Read/Write, expand-down
|
||||
| 7 0 1 1 1 | Data | Read/Write, expand-down, accessed
|
||||
| C R A | |
|
||||
| 8 1 0 0 0 | Code | Execute-Only
|
||||
| 9 1 0 0 1 | Code | Execute-Only, accessed
|
||||
| 10 1 0 1 0 | Code | Execute/Read
|
||||
| 11 1 0 1 1 | Code | Execute/Read, accessed
|
||||
| 12 1 1 0 0 | Code | Execute-Only, conforming
|
||||
| 14 1 1 0 1 | Code | Execute-Only, conforming, accessed
|
||||
| 13 1 1 1 0 | Code | Execute/Read, conforming
|
||||
| 15 1 1 1 1 | Code | Execute/Read, conforming, accessed
|
||||
| 0 E W A | |
|
||||
| 0 0 0 0 0 | Data | Read-Only |
|
||||
| 1 0 0 0 1 | Data | Read-Only, accessed |
|
||||
| 2 0 0 1 0 | Data | Read/Write |
|
||||
| 3 0 0 1 1 | Data | Read/Write, accessed |
|
||||
| 4 0 1 0 0 | Data | Read-Only, expand-down |
|
||||
| 5 0 1 0 1 | Data | Read-Only, expand-down, accessed |
|
||||
| 6 0 1 1 0 | Data | Read/Write, expand-down |
|
||||
| 7 0 1 1 1 | Data | Read/Write, expand-down, accessed |
|
||||
| C R A | |
|
||||
| 8 1 0 0 0 | Code | Execute-Only |
|
||||
| 9 1 0 0 1 | Code | Execute-Only, accessed |
|
||||
| 10 1 0 1 0 | Code | Execute/Read |
|
||||
| 11 1 0 1 1 | Code | Execute/Read, accessed |
|
||||
| 12 1 1 0 0 | Code | Execute-Only, conforming |
|
||||
| 14 1 1 0 1 | Code | Execute-Only, conforming, accessed |
|
||||
| 13 1 1 1 0 | Code | Execute/Read, conforming |
|
||||
| 15 1 1 1 1 | Code | Execute/Read, conforming, accessed |
|
||||
```
|
||||
|
||||
As we can see the first bit(bit 43) is `0` for a _data_ segment and `1` for a _code_ segment. The next three bits (40, 41, 42) are either `EWA`(*E*xpansion *W*ritable *A*ccessible) or CRA(*C*onforming *R*eadable *A*ccessible).
|
||||
* if E(bit 42) is 0, expand up otherwise expand down. Read more [here](http://www.sudleyplace.com/dpmione/expanddown.html).
|
||||
* if W(bit 41)(for Data Segments) is 1, write access is allowed otherwise not. Note that read access is always allowed on data segments.
|
||||
* A(bit 40) - Whether the segment is accessed by processor or not.
|
||||
* C(bit 43) is conforming bit(for code selectors). If C is 1, the segment code can be executed from a lower level privilege e.g. user level. If C is 0, it can only be executed from the same privilege level.
|
||||
* R(bit 41)(for code segments). If 1 read access to segment is allowed otherwise not. Write access is never allowed to code segments.
|
||||
* if E(bit 42) is 0, expand up, otherwise, expand down. Read more [here](http://www.sudleyplace.com/dpmione/expanddown.html).
|
||||
* if W(bit 41)(for Data Segments) is 1, write access is allowed, and if it is 0, the segment is read-only. Note that read access is always allowed on data segments.
|
||||
* A(bit 40) controls whether the segment can be accessed by the processor or not.
|
||||
* C(bit 43) is the conforming bit(for code selectors). If C is 1, the segment code can be executed from a lower level privilege (e.g. user) level. If C is 0, it can only be executed from the same privilege level.
|
||||
* R(bit 41) controls read access to code segments; when it is 1, the segment can be read from. Write access is never granted for code segments.
|
||||
|
||||
4. DPL[2-bits] (Descriptor Privilege Level) is at bits 45-46. It defines the privilege level of the segment. It can be 0-3 where 0 is the most privileged.
|
||||
4. DPL[2-bits] (Descriptor Privilege Level) comprises the bits 45-46. It defines the privilege level of the segment. It can be 0-3 where 0 is the most privileged level.
|
||||
|
||||
5. P flag(bit 47) - indicates if the segment is present in memory or not. If P is 0, the segment will be presented as _invalid_ and the processor will refuse to read this segment.
|
||||
5. The P flag(bit 47) indicates if the segment is present in memory or not. If P is 0, the segment will be presented as _invalid_ and the processor will refuse to read from this segment.
|
||||
|
||||
6. AVL flag(bit 52) - Available and reserved bits. It is ignored in Linux.
|
||||
|
||||
7. L flag(bit 53) - indicates whether a code segment contains native 64-bit code. If 1 then the code segment executes in 64-bit mode.
|
||||
7. The L flag(bit 53) indicates whether a code segment contains native 64-bit code. If it is set, then the code segment executes in 64-bit mode.
|
||||
|
||||
8. D/B flag(bit 54) - Default/Big flag represents the operand size i.e 16/32 bits. If it is set then 32 bit otherwise 16.
|
||||
8. The D/B flag(bit 54) (Default/Big flag) represents the operand size i.e 16/32 bits. If set, operand size is 32 bits. Otherwise, it is 16 bits.
|
||||
|
||||
Segment registers contain segment selectors as in real mode. However, in protected mode, a segment selector is handled differently. Each Segment Descriptor has an associated Segment Selector which is a 16-bit structure:
|
||||
|
||||
```
|
||||
15 3 2 1 0
|
||||
-----------------------------
|
||||
| Index | TI | RPL |
|
||||
-----------------------------
|
||||
| Index | TI | RPL |
|
||||
| ----- |
|
||||
```
|
||||
|
||||
Where,
|
||||
* **Index** shows the index number of the descriptor in the GDT.
|
||||
* **TI**(Table Indicator) shows where to search for the descriptor. If it is 0 then search in the Global Descriptor Table(GDT) otherwise it will look in Local Descriptor Table(LDT).
|
||||
* And **RPL** is Requester's Privilege Level.
|
||||
* **Index** stores the index number of the descriptor in the GDT.
|
||||
* **TI**(Table Indicator) indicates where to search for the descriptor. If it is 0 then the descriptor is searched for in the Global Descriptor Table(GDT). Otherwise, it will be searched for in the Local Descriptor Table(LDT).
|
||||
* And **RPL** contains the Requester's Privilege Level.
|
||||
|
||||
Every segment register has a visible and hidden part.
|
||||
* Visible - Segment Selector is stored here
|
||||
* Hidden - Segment Descriptor(base, limit, attributes, flags)
|
||||
Every segment register has a visible and a hidden part.
|
||||
* Visible - The Segment Selector is stored here.
|
||||
* Hidden - The Segment Descriptor (which contains the base, limit, attributes & flags) is stored here.
|
||||
|
||||
The following steps are needed to get the physical address in the protected mode:
|
||||
The following steps are needed to get a physical address in protected mode:
|
||||
|
||||
* The segment selector must be loaded in one of the segment registers
|
||||
* The CPU tries to find a segment descriptor by GDT address + Index from selector and load the descriptor into the *hidden* part of the segment register
|
||||
* Base address (from segment descriptor) + offset will be the linear address of the segment which is the physical address (if paging is disabled).
|
||||
* The segment selector must be loaded in one of the segment registers.
|
||||
* The CPU tries to find a segment descriptor at the offset `GDT address + Index` from the selector and then loads the descriptor into the *hidden* part of the segment register.
|
||||
* If paging is disabled, the linear address of the segment, or its physical address, is given by the formula: Base address (found in the descriptor obtained in the previous step) + Offset.
|
||||
|
||||
Schematically it will look like this:
|
||||
|
||||
@ -169,8 +169,8 @@ Schematically it will look like this:
|
||||
The algorithm for the transition from real mode into protected mode is:
|
||||
|
||||
* Disable interrupts
|
||||
* Describe and load GDT with `lgdt` instruction
|
||||
* Set PE (Protection Enable) bit in CR0 (Control Register 0)
|
||||
* Describe and load the GDT with the `lgdt` instruction
|
||||
* Set the PE (Protection Enable) bit in CR0 (Control Register 0)
|
||||
* Jump to protected mode code
|
||||
|
||||
We will see the complete transition to protected mode in the linux kernel in the next part, but before we can move to protected mode, we need to do some more preparations.
|
||||
@ -180,15 +180,15 @@ Let's look at [arch/x86/boot/main.c](https://github.com/torvalds/linux/blob/16f7
|
||||
Copying boot parameters into the "zeropage"
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
We will start from the `main` routine in "main.c". First function which is called in `main` is [`copy_boot_params(void)`](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/main.c#L30). It copies the kernel setup header into the field of the `boot_params` structure which is defined in the [arch/x86/include/uapi/asm/bootparam.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/uapi/asm/bootparam.h#L113).
|
||||
We will start from the `main` routine in "main.c". The first function which is called in `main` is [`copy_boot_params(void)`](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/main.c#L30). It copies the kernel setup header into the corresponding field of the `boot_params` structure which is defined in the file [arch/x86/include/uapi/asm/bootparam.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/uapi/asm/bootparam.h#L113).
|
||||
|
||||
The `boot_params` structure contains the `struct setup_header hdr` field. This structure contains the same fields as defined in [linux boot protocol](https://www.kernel.org/doc/Documentation/x86/boot.txt) and is filled by the boot loader and also at kernel compile/build time. `copy_boot_params` does two things:
|
||||
The `boot_params` structure contains the `struct setup_header hdr` field. This structure contains the same fields as defined in the [linux boot protocol](https://www.kernel.org/doc/Documentation/x86/boot.txt) and is filled by the boot loader and also at kernel compile/build time. `copy_boot_params` does two things:
|
||||
|
||||
1. Copies `hdr` from [header.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/header.S#L281) to the `boot_params` structure in `setup_header` field
|
||||
1. It copies `hdr` from [header.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/header.S#L281) to the `boot_params` structure in `setup_header` field
|
||||
|
||||
2. Updates pointer to the kernel command line if the kernel was loaded with the old command line protocol.
|
||||
2. It updates the pointer to the kernel command line if the kernel was loaded with the old command line protocol.
|
||||
|
||||
Note that it copies `hdr` with `memcpy` function which is defined in the [copy.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/copy.S) source file. Let's have a look inside:
|
||||
Note that it copies `hdr` with the `memcpy` function, defined in the [copy.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/copy.S) source file. Let's have a look inside:
|
||||
|
||||
```assembly
|
||||
GLOBAL(memcpy)
|
||||
@ -208,27 +208,27 @@ GLOBAL(memcpy)
|
||||
ENDPROC(memcpy)
|
||||
```
|
||||
|
||||
Yeah, we just moved to C code and now assembly again :) First of all, we can see that `memcpy` and other routines which are defined here, start and end with the two macros: `GLOBAL` and `ENDPROC`. `GLOBAL` is described in [arch/x86/include/asm/linkage.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/linkage.h) which defines `globl` directive and the label for it. `ENDPROC` is described in [include/linux/linkage.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/linkage.h) which marks the `name` symbol as a function name and ends with the size of the `name` symbol.
|
||||
Yeah, we just moved to C code and now assembly again :) First of all, we can see that `memcpy` and other routines which are defined here, start and end with the two macros: `GLOBAL` and `ENDPROC`. `GLOBAL` is described in [arch/x86/include/asm/linkage.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/linkage.h) which defines the `globl` directive and its label. `ENDPROC` is described in [include/linux/linkage.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/linkage.h) and marks the `name` symbol as a function name and ends with the size of the `name` symbol.
|
||||
|
||||
Implementation of `memcpy` is easy. At first, it pushes values from the `si` and `di` registers to the stack to preserve their values because they will change during the `memcpy`. `memcpy` (and other functions in copy.S) use `fastcall` calling conventions. So it gets its incoming parameters from the `ax`, `dx` and `cx` registers. Calling `memcpy` looks like this:
|
||||
The implementation of `memcpy` is simple. At first, it pushes values from the `si` and `di` registers to the stack to preserve their values because they will change during the `memcpy`. As we can see in the `REALMODE_CFLAGS` in `arch/x86/Makefile`, the kernel build system uses the `-mregparm=3` option of GCC, so functions get the first three parameters from `ax`, `dx` and `cx` registers. Calling `memcpy` looks like this:
|
||||
|
||||
```c
|
||||
memcpy(&boot_params.hdr, &hdr, sizeof hdr);
|
||||
```
|
||||
|
||||
So,
|
||||
* `ax` will contain the address of the `boot_params.hdr`
|
||||
* `ax` will contain the address of `boot_params.hdr`
|
||||
* `dx` will contain the address of `hdr`
|
||||
* `cx` will contain the size of `hdr` in bytes.
|
||||
|
||||
`memcpy` puts the address of `boot_params.hdr` into `di` and saves the size on the stack. After this it shifts to the right on 2 size (or divide on 4) and copies from `si` to `di` by 4 bytes. After this, we restore the size of `hdr` again, align it by 4 bytes and copy the rest of the bytes from `si` to `di` byte by byte (if there is more). Restore `si` and `di` values from the stack in the end and after this copying is finished.
|
||||
`memcpy` puts the address of `boot_params.hdr` into `di` and saves `cx` on the stack. After this it shifts the value right 2 times (or divides it by 4) and copies four bytes from the address at `si` to the address at `di`. After this, we restore the size of `hdr` again, align it by 4 bytes and copy the rest of the bytes from the address at `si` to the address at `di` byte by byte (if there is more). Now the values of `si` and `di` are restored from the stack and the copying operation is finished.
|
||||
|
||||
Console initialization
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After `hdr` is copied into `boot_params.hdr`, the next step is console initialization by calling the `console_init` function which is defined in [arch/x86/boot/early_serial_console.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/early_serial_console.c).
|
||||
After `hdr` is copied into `boot_params.hdr`, the next step is to initialize the console by calling the `console_init` function, defined in [arch/x86/boot/early_serial_console.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/early_serial_console.c).
|
||||
|
||||
It tries to find the `earlyprintk` option in the command line and if the search was successful, it parses the port address and baud rate of the serial port and initializes the serial port. The value of `earlyprintk` command line option can be one of these:
|
||||
It tries to find the `earlyprintk` option in the command line and if the search was successful, it parses the port address and baud rate of the serial port and initializes the serial port. The value of the `earlyprintk` command line option can be one of these:
|
||||
|
||||
* serial,0x3f8,115200
|
||||
* serial,ttyS0,115200
|
||||
@ -258,7 +258,7 @@ void __attribute__((section(".inittext"))) putchar(int ch)
|
||||
|
||||
`__attribute__((section(".inittext")))` means that this code will be in the `.inittext` section. We can find it in the linker file [setup.ld](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/setup.ld#L19).
|
||||
|
||||
First of all, `putchar` checks for the `\n` symbol and if it is found, prints `\r` before. After that it outputs the character on the VGA screen by calling the BIOS with the `0x10` interrupt call:
|
||||
First of all, `putchar` checks for the `\n` symbol and if it is found, prints `\r` before. After that it prints the character on the VGA screen by calling the BIOS with the `0x10` interrupt call:
|
||||
|
||||
```C
|
||||
static void __attribute__((section(".inittext"))) bios_putchar(int ch)
|
||||
@ -285,7 +285,7 @@ Here `initregs` takes the `biosregs` structure and first fills `biosregs` with z
|
||||
reg->gs = gs();
|
||||
```
|
||||
|
||||
Let's look at the [memset](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/copy.S#L36) implementation:
|
||||
Let's look at the implementation of [memset](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/copy.S#L36):
|
||||
|
||||
```assembly
|
||||
GLOBAL(memset)
|
||||
@ -304,22 +304,22 @@ GLOBAL(memset)
|
||||
ENDPROC(memset)
|
||||
```
|
||||
|
||||
As you can read above, it uses the `fastcall` calling conventions like the `memcpy` function, which means that the function gets parameters from `ax`, `dx` and `cx` registers.
|
||||
As you can read above, it uses the same calling conventions as the `memcpy` function, which means that the function gets its parameters from the `ax`, `dx` and `cx` registers.
|
||||
|
||||
Generally `memset` is like a memcpy implementation. It saves the value of the `di` register on the stack and puts the `ax` value into `di` which is the address of the `biosregs` structure. Next is the `movzbl` instruction, which copies the `dl` value to the low 2 bytes of the `eax` register. The remaining 2 high bytes of `eax` will be filled with zeros.
|
||||
The implementation of `memset` is similar to that of memcpy. It saves the value of the `di` register on the stack and puts the value of`ax`, which stores the address of the `biosregs` structure, into `di` . Next is the `movzbl` instruction, which copies the value of `dl` to the lower 2 bytes of the `eax` register. The remaining 2 high bytes of `eax` will be filled with zeros.
|
||||
|
||||
The next instruction multiplies `eax` with `0x01010101`. It needs to because `memset` will copy 4 bytes at the same time. For example, we need to fill a structure with `0x7` with memset. `eax` will contain `0x00000007` value in this case. So if we multiply `eax` with `0x01010101`, we will get `0x07070707` and now we can copy these 4 bytes into the structure. `memset` uses `rep; stosl` instructions for copying `eax` into `es:di`.
|
||||
The next instruction multiplies `eax` with `0x01010101`. It needs to because `memset` will copy 4 bytes at the same time. For example, if we need to fill a structure whose size is 4 bytes with the value `0x7` with memset, `eax` will contain the `0x00000007`. So if we multiply `eax` with `0x01010101`, we will get `0x07070707` and now we can copy these 4 bytes into the structure. `memset` uses the `rep; stosl` instruction to copy `eax` into `es:di`.
|
||||
|
||||
The rest of the `memset` function does almost the same as `memcpy`.
|
||||
The rest of the `memset` function does almost the same thing as `memcpy`.
|
||||
|
||||
After the `biosregs` structure is filled with `memset`, `bios_putchar` calls the [0x10](http://www.ctyme.com/intr/rb-0106.htm) interrupt which prints a character. Afterwards it checks if the serial port was initialized or not and writes a character there with [serial_putchar](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/tty.c#L30) and `inb/outb` instructions if it was set.
|
||||
|
||||
Heap initialization
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After the stack and bss section were prepared in [header.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/header.S) (see previous [part](linux-bootstrap-1.md)), the kernel needs to initialize the [heap](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/main.c#L116) with the [`init_heap`](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/main.c#L116) function.
|
||||
After the stack and bss section have been prepared in [header.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/header.S) (see previous [part](linux-bootstrap-1.md)), the kernel needs to initialize the [heap](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/main.c#L116) with the [`init_heap`](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/main.c#L116) function.
|
||||
|
||||
First of all `init_heap` checks the [`CAN_USE_HEAP`](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/uapi/asm/bootparam.h#L22) flag from the [`loadflags`](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/header.S#L321) in the kernel setup header and calculates the end of the stack if this flag was set:
|
||||
First of all `init_heap` checks the [`CAN_USE_HEAP`](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/uapi/asm/bootparam.h#L22) flag from the [`loadflags`](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/header.S#L321) structure in the kernel setup header and calculates the end of the stack if this flag was set:
|
||||
|
||||
```C
|
||||
char *stack_end;
|
||||
@ -339,12 +339,12 @@ Then there is the `heap_end` calculation:
|
||||
|
||||
which means `heap_end_ptr` or `_end` + `512` (`0x200h`). The last check is whether `heap_end` is greater than `stack_end`. If it is then `stack_end` is assigned to `heap_end` to make them equal.
|
||||
|
||||
Now the heap is initialized and we can use it using the `GET_HEAP` method. We will see how it is used, how to use it and how it is implemented in the next posts.
|
||||
Now the heap is initialized and we can use it using the `GET_HEAP` method. We will see what it is used for, how to use it and how it is implemented in the next posts.
|
||||
|
||||
CPU validation
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next step as we can see is cpu validation by `validate_cpu` from [arch/x86/boot/cpu.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/cpu.c).
|
||||
The next step as we can see is cpu validation through the `validate_cpu` function from [arch/x86/boot/cpu.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/cpu.c).
|
||||
|
||||
It calls the [`check_cpu`](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/cpucheck.c#L112) function and passes cpu level and required cpu level to it and checks that the kernel launches on the right cpu level.
|
||||
```c
|
||||
@ -354,14 +354,14 @@ if (cpu_level < req_level) {
|
||||
return -1;
|
||||
}
|
||||
```
|
||||
`check_cpu` checks the CPU's flags, the presence of [long mode](http://en.wikipedia.org/wiki/Long_mode) in the case of x86_64(64-bit) CPU, checks the processor's vendor and makes preparation for certain vendors like turning off SSE+SSE2 for AMD if they are missing, etc.
|
||||
`check_cpu` checks the CPU's flags, the presence of [long mode](http://en.wikipedia.org/wiki/Long_mode) in the case of x86_64(64-bit) CPU, checks the processor's vendor and makes preparations for certain vendors like turning off SSE+SSE2 for AMD if they are missing, etc.
|
||||
|
||||
Memory detection
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next step is memory detection by the `detect_memory` function. `detect_memory` basically provides a map of available RAM to the CPU. It uses different programming interfaces for memory detection like `0xe820`, `0xe801` and `0x88`. We will see only the implementation of **0xE820** here.
|
||||
The next step is memory detection through the `detect_memory` function. `detect_memory` basically provides a map of available RAM to the CPU. It uses different programming interfaces for memory detection like `0xe820`, `0xe801` and `0x88`. We will see only the implementation of the **0xE820** interface here.
|
||||
|
||||
Let's look into the `detect_memory_e820` implementation from the [arch/x86/boot/memory.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/memory.c) source file. First of all, the `detect_memory_e820` function initializes the `biosregs` structure as we saw above and fills registers with special values for the `0xe820` call:
|
||||
Let's look at the implementation of the `detect_memory_e820` function from the [arch/x86/boot/memory.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/memory.c) source file. First of all, the `detect_memory_e820` function initializes the `biosregs` structure as we saw above and fills registers with special values for the `0xe820` call:
|
||||
|
||||
```assembly
|
||||
initregs(&ireg);
|
||||
@ -372,23 +372,23 @@ Let's look into the `detect_memory_e820` implementation from the [arch/x86/boot/
|
||||
```
|
||||
|
||||
* `ax` contains the number of the function (0xe820 in our case)
|
||||
* `cx` register contains size of the buffer which will contain data about memory
|
||||
* `cx` contains the size of the buffer which will contain data about the memory
|
||||
* `edx` must contain the `SMAP` magic number
|
||||
* `es:di` must contain the address of the buffer which will contain memory data
|
||||
* `ebx` has to be zero.
|
||||
|
||||
Next is a loop where data about the memory will be collected. It starts from the call of the `0x15` BIOS interrupt, which writes one line from the address allocation table. For getting the next line we need to call this interrupt again (which we do in the loop). Before the next call `ebx` must contain the value returned previously:
|
||||
Next is a loop where data about the memory will be collected. It starts with a call to the `0x15` BIOS interrupt, which writes one line from the address allocation table. For getting the next line we need to call this interrupt again (which we do in the loop). Before the next call `ebx` must contain the value returned previously:
|
||||
|
||||
```C
|
||||
intcall(0x15, &ireg, &oreg);
|
||||
ireg.ebx = oreg.ebx;
|
||||
```
|
||||
|
||||
Ultimately, it does iterations in the loop to collect data from the address allocation table and writes this data into the `e820_entry` array:
|
||||
Ultimately, this function collects data from the address allocation table and writes this data into the `e820_entry` array:
|
||||
|
||||
* start of memory segment
|
||||
* size of memory segment
|
||||
* type of memory segment (which can be reserved, usable and etc...).
|
||||
* type of memory segment (whether the particular segment is usable or reserved)
|
||||
|
||||
You can see the result of this in the `dmesg` output, something like:
|
||||
|
||||
@ -402,7 +402,7 @@ You can see the result of this in the `dmesg` output, something like:
|
||||
[ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
|
||||
```
|
||||
|
||||
Next we may see call of the `set_bios_mode` function after physical memory is detected. As we may see, this function is implemented only for the `x86_64` mode:
|
||||
Next, we may see a call to the `set_bios_mode` function. As we may see, this function is implemented only for the `x86_64` mode:
|
||||
|
||||
```C
|
||||
static void set_bios_mode(void)
|
||||
@ -418,12 +418,12 @@ static void set_bios_mode(void)
|
||||
}
|
||||
```
|
||||
|
||||
The `set_bios_mode` executes `0x15` BIOS interrupt to tell the BIOS that [long mode](https://en.wikipedia.org/wiki/Long_mode) (in a case of `bx = 2`) will be used.
|
||||
The `set_bios_mode` function executes the `0x15` BIOS interrupt to tell the BIOS that [long mode](https://en.wikipedia.org/wiki/Long_mode) (if `bx == 2`) will be used.
|
||||
|
||||
Keyboard initialization
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next step is the initialization of the keyboard with the call of the [`keyboard_init()`](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/main.c#L65) function. At first `keyboard_init` initializes registers using the `initregs` function and calling the [0x16](http://www.ctyme.com/intr/rb-1756.htm) interrupt for getting the keyboard status.
|
||||
The next step is the initialization of the keyboard with a call to the [`keyboard_init()`](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/main.c#L65) function. At first `keyboard_init` initializes registers using the `initregs` function. It then calls the [0x16](http://www.ctyme.com/intr/rb-1756.htm) interrupt to query the status of the keyboard.
|
||||
|
||||
```c
|
||||
initregs(&ireg);
|
||||
@ -432,7 +432,7 @@ The next step is the initialization of the keyboard with the call of the [`keybo
|
||||
boot_params.kbd_status = oreg.al;
|
||||
```
|
||||
|
||||
After this it calls [0x16](http://www.ctyme.com/intr/rb-1757.htm) again to set repeat rate and delay.
|
||||
After this it calls [0x16](http://www.ctyme.com/intr/rb-1757.htm) again to set the repeat rate and delay.
|
||||
|
||||
```c
|
||||
ireg.ax = 0x0305; /* Set keyboard repeat rate */
|
||||
@ -442,15 +442,15 @@ After this it calls [0x16](http://www.ctyme.com/intr/rb-1757.htm) again to set r
|
||||
Querying
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next couple of steps are queries for different parameters. We will not dive into details about these queries but will get back to it in later parts. Let's take a short look at these functions:
|
||||
The next couple of steps are queries for different parameters. We will not dive into details about these queries but we will get back to them in later parts. Let's take a short look at these functions:
|
||||
|
||||
The first step is getting [Intel SpeedStep](http://en.wikipedia.org/wiki/SpeedStep) information by calling the `query_ist` function. First of all, it checks the CPU level and if it is correct, calls `0x15` for getting info and saves the result to `boot_params`.
|
||||
The first step is getting [Intel SpeedStep](http://en.wikipedia.org/wiki/SpeedStep) information by calling the `query_ist` function. It checks the CPU level and if it is correct, calls `0x15` to get the info and saves the result to `boot_params`.
|
||||
|
||||
The following [query_apm_bios](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/apm.c#L21) function gets [Advanced Power Management](http://en.wikipedia.org/wiki/Advanced_Power_Management) information from the BIOS. `query_apm_bios` calls the `0x15` BIOS interruption too, but with `ah` = `0x53` to check `APM` installation. After the `0x15` execution, `query_apm_bios` functions check the `PM` signature (it must be `0x504d`), carry flag (it must be 0 if `APM` supported) and value of the `cx` register (if it's 0x02, protected mode interface is supported).
|
||||
Next, the [query_apm_bios](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/apm.c#L21) function gets [Advanced Power Management](http://en.wikipedia.org/wiki/Advanced_Power_Management) information from the BIOS. `query_apm_bios` calls the `0x15` BIOS interruption too, but with `ah` = `0x53` to check `APM` installation. After `0x15` finishes executing, the `query_apm_bios` functions check the `PM` signature (it must be `0x504d`), the carry flag (it must be 0 if `APM` supported) and the value of the `cx` register (if it's 0x02, the protected mode interface is supported).
|
||||
|
||||
Next, it calls `0x15` again, but with `ax = 0x5304` for disconnecting the `APM` interface and connecting the 32-bit protected mode interface. In the end, it fills `boot_params.apm_bios_info` with values obtained from the BIOS.
|
||||
Next, it calls `0x15` again, but with `ax = 0x5304` to disconnect the `APM` interface and connect the 32-bit protected mode interface. In the end, it fills `boot_params.apm_bios_info` with values obtained from the BIOS.
|
||||
|
||||
Note that `query_apm_bios` will be executed only if `CONFIG_APM` or `CONFIG_APM_MODULE` was set in the configuration file:
|
||||
Note that `query_apm_bios` will be executed only if the `CONFIG_APM` or `CONFIG_APM_MODULE` compile time flag was set in the configuration file:
|
||||
|
||||
```C
|
||||
#if defined(CONFIG_APM) || defined(CONFIG_APM_MODULE)
|
||||
@ -458,7 +458,7 @@ Note that `query_apm_bios` will be executed only if `CONFIG_APM` or `CONFIG_APM_
|
||||
#endif
|
||||
```
|
||||
|
||||
The last is the [`query_edd`](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/edd.c#L122) function, which queries `Enhanced Disk Drive` information from the BIOS. Let's look into the `query_edd` implementation.
|
||||
The last is the [`query_edd`](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/edd.c#L122) function, which queries `Enhanced Disk Drive` information from the BIOS. Let's look at how `query_edd` is implemented.
|
||||
|
||||
First of all, it reads the [edd](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/kernel-parameters.txt#L1023) option from the kernel's command line and if it was set to `off` then `query_edd` just returns.
|
||||
|
||||
@ -477,12 +477,12 @@ for (devno = 0x80; devno < 0x80+EDD_MBR_SIG_MAX; devno++) {
|
||||
}
|
||||
```
|
||||
|
||||
where `0x80` is the first hard drive and the value of `EDD_MBR_SIG_MAX` macro is 16. It collects data into the array of [edd_info](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/uapi/linux/edd.h#L172) structures. `get_edd_info` checks that EDD is present by invoking the `0x13` interrupt with `ah` as `0x41` and if EDD is present, `get_edd_info` again calls the `0x13` interrupt, but with `ah` as `0x48` and `si` containing the address of the buffer where EDD information will be stored.
|
||||
where `0x80` is the first hard drive and the value of the `EDD_MBR_SIG_MAX` macro is 16. It collects data into an array of [edd_info](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/uapi/linux/edd.h#L172) structures. `get_edd_info` checks that EDD is present by invoking the `0x13` interrupt with `ah` as `0x41` and if EDD is present, `get_edd_info` again calls the `0x13` interrupt, but with `ah` as `0x48` and `si` containing the address of the buffer where EDD information will be stored.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the second part about Linux kernel insides. In the next part, we will see video mode setting and the rest of preparations before the transition to protected mode and directly transitioning into it.
|
||||
This is the end of the second part about the insides of the Linux kernel. In the next part, we will see video mode setting and the rest of the preparations before the transition to protected mode and directly transitioning into it.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
|
@ -4,11 +4,11 @@ Kernel booting process. Part 3.
|
||||
Video mode initialization and transition to protected mode
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the third part of the `Kernel booting process` series. In the previous [part](linux-bootstrap-2.md#kernel-booting-process-part-2), we stopped right before the call of the `set_video` routine from [main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/main.c#L181). In this part, we will see:
|
||||
This is the third part of the `Kernel booting process` series. In the previous [part](linux-bootstrap-2.md#kernel-booting-process-part-2), we stopped right before the call to the `set_video` routine from [main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/main.c#L181). In this part, we will look at:
|
||||
|
||||
* video mode initialization in the kernel setup code,
|
||||
* preparation before switching into protected mode,
|
||||
* transition to protected mode
|
||||
* the preparations made before switching into protected mode,
|
||||
* the transition to protected mode
|
||||
|
||||
**NOTE** If you don't know anything about protected mode, you can find some information about it in the previous [part](linux-bootstrap-2.md#protected-mode). Also, there are a couple of [links](linux-bootstrap-2.md#links) which can help you.
|
||||
|
||||
@ -18,7 +18,7 @@ As I wrote above, we will start from the `set_video` function which is defined i
|
||||
u16 mode = boot_params.hdr.vid_mode;
|
||||
```
|
||||
|
||||
which we filled in the `copy_boot_params` function (you can read about it in the previous post). The `vid_mode` is an obligatory field which is filled by the bootloader. You can find information about it in the kernel `boot protocol`:
|
||||
which we filled in the `copy_boot_params` function (you can read about it in the previous post). `vid_mode` is an obligatory field which is filled by the bootloader. You can find information about it in the kernel `boot protocol`:
|
||||
|
||||
```
|
||||
Offset Proto Name Meaning
|
||||
@ -38,7 +38,7 @@ vga=<mode>
|
||||
line is parsed.
|
||||
```
|
||||
|
||||
So we can add `vga` option to the grub or another bootloader configuration file and it will pass this option to the kernel command line. This option can have different values as mentioned in the description. For example, it can be an integer number `0xFFFD` or `ask`. If you pass `ask` to `vga`, you will see a menu like this:
|
||||
So we can add the `vga` option to the grub (or another bootloader's) configuration file and it will pass this option to the kernel command line. This option can have different values as mentioned in the description. For example, it can be an integer number `0xFFFD` or `ask`. If you pass `ask` to `vga`, you will see a menu like this:
|
||||
|
||||
![video mode setup menu](http://oi59.tinypic.com/ejcz81.jpg)
|
||||
|
||||
@ -65,13 +65,13 @@ After we get `vid_mode` from `boot_params.hdr` in the `set_video` function, we c
|
||||
#define RESET_HEAP() ((void *)( HEAP = _end ))
|
||||
```
|
||||
|
||||
If you have read the second part, you will remember that we initialized the heap with the [`init_heap`](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/main.c#L116) function. We have a couple of utility functions for heap which are defined in `boot.h`. They are:
|
||||
If you have read the second part, you will remember that we initialized the heap with the [`init_heap`](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/main.c#L116) function. We have a couple of utility functions for managing the heap which are defined in `boot.h`. They are:
|
||||
|
||||
```C
|
||||
#define RESET_HEAP()
|
||||
```
|
||||
|
||||
As we saw just above, it resets the heap by setting the `HEAP` variable equal to `_end`, where `_end` is just `extern char _end[];`
|
||||
As we saw just above, it resets the heap by setting the `HEAP` variable to `_end`, where `_end` is just `extern char _end[];`
|
||||
|
||||
Next is the `GET_HEAP` macro:
|
||||
|
||||
@ -82,11 +82,11 @@ Next is the `GET_HEAP` macro:
|
||||
|
||||
for heap allocation. It calls the internal function `__get_heap` with 3 parameters:
|
||||
|
||||
* size of a type in bytes, which need be allocated
|
||||
* `__alignof__(type)` shows how variables of this type are aligned
|
||||
* `n` tells how many items to allocate
|
||||
* the size of the datatype to be allocated for
|
||||
* `__alignof__(type)` specifies how variables of this type are to be aligned
|
||||
* `n` specifies how many items to allocate
|
||||
|
||||
Implementation of `__get_heap` is:
|
||||
The implementation of `__get_heap` is:
|
||||
|
||||
```C
|
||||
static inline char *__get_heap(size_t s, size_t a, size_t n)
|
||||
@ -106,7 +106,7 @@ and we will further see its usage, something like:
|
||||
saved.data = GET_HEAP(u16, saved.x * saved.y);
|
||||
```
|
||||
|
||||
Let's try to understand how `__get_heap` works. We can see here that `HEAP` (which is equal to `_end` after `RESET_HEAP()`) is the address of aligned memory according to the `a` parameter. After this we save the memory address from `HEAP` to the `tmp` variable, move `HEAP` to the end of the allocated block and return `tmp` which is the start address of allocated memory.
|
||||
Let's try to understand how `__get_heap` works. We can see here that `HEAP` (which is equal to `_end` after `RESET_HEAP()`) is assigned the address of the aligned memory according to the `a` parameter. After this we save the memory address from `HEAP` to the `tmp` variable, move `HEAP` to the end of the allocated block and return `tmp` which is the start address of allocated memory.
|
||||
|
||||
And the last function is:
|
||||
|
||||
@ -117,7 +117,7 @@ static inline bool heap_free(size_t n)
|
||||
}
|
||||
```
|
||||
|
||||
which subtracts value of the `HEAP` from the `heap_end` (we calculated it in the previous [part](linux-bootstrap-2.md)) and returns 1 if there is enough memory for `n`.
|
||||
which subtracts value of the `HEAP` pointer from the `heap_end` (we calculated it in the previous [part](linux-bootstrap-2.md)) and returns 1 if there is enough memory available for `n`.
|
||||
|
||||
That's all. Now we have a simple API for heap and can setup video mode.
|
||||
|
||||
@ -126,20 +126,20 @@ Set up video mode
|
||||
|
||||
Now we can move directly to video mode initialization. We stopped at the `RESET_HEAP()` call in the `set_video` function. Next is the call to `store_mode_params` which stores video mode parameters in the `boot_params.screen_info` structure which is defined in [include/uapi/linux/screen_info.h](https://github.com/0xAX/linux/blob/0e271fd59fe9e6d8c932309e7a42a4519c5aac6f/include/uapi/linux/screen_info.h).
|
||||
|
||||
If we look at the `store_mode_params` function, we can see that it starts with the call to the `store_cursor_position` function. As you can understand from the function name, it gets information about cursor and stores it.
|
||||
If we look at the `store_mode_params` function, we can see that it starts with a call to the `store_cursor_position` function. As you can understand from the function name, it gets information about the cursor and stores it.
|
||||
|
||||
First of all, `store_cursor_position` initializes two variables which have type `biosregs` with `AH = 0x3`, and calls `0x10` BIOS interruption. After the interruption is successfully executed, it returns row and column in the `DL` and `DH` registers. Row and column will be stored in the `orig_x` and `orig_y` fields from the `boot_params.screen_info` structure.
|
||||
First of all, `store_cursor_position` initializes two variables which have type `biosregs` with `AH = 0x3`, and calls the `0x10` BIOS interruption. After the interruption is successfully executed, it returns row and column in the `DL` and `DH` registers. Row and column will be stored in the `orig_x` and `orig_y` fields of the `boot_params.screen_info` structure.
|
||||
|
||||
After `store_cursor_position` is executed, the `store_video_mode` function will be called. It just gets the current video mode and stores it in `boot_params.screen_info.orig_video_mode`.
|
||||
|
||||
After this, the `store_mode_params` checks the current video mode and sets the `video_segment`. After the BIOS transfers control to the boot sector, the following addresses are for video memory:
|
||||
After this, `store_mode_params` checks the current video mode and sets the `video_segment`. After the BIOS transfers control to the boot sector, the following addresses are for video memory:
|
||||
|
||||
```
|
||||
0xB000:0x0000 32 Kb Monochrome Text Video Memory
|
||||
0xB800:0x0000 32 Kb Color Text Video Memory
|
||||
```
|
||||
|
||||
So we set the `video_segment` variable to `0xb000` if the current video mode is MDA, HGC, or VGA in monochrome mode and to `0xb800` if the current video mode is in color mode. After setting up the address of the video segment, font size needs to be stored in `boot_params.screen_info.orig_video_points` with:
|
||||
So we set the `video_segment` variable to `0xb000` if the current video mode is MDA, HGC, or VGA in monochrome mode and to `0xb800` if the current video mode is in color mode. After setting up the address of the video segment, the font size needs to be stored in `boot_params.screen_info.orig_video_points` with:
|
||||
|
||||
```C
|
||||
set_fs(0);
|
||||
@ -156,7 +156,7 @@ y = (adapter == ADAPTER_CGA) ? 25 : rdfs8(0x484)+1;
|
||||
|
||||
Next, we get the amount of columns by address `0x44a` and rows by address `0x484` and store them in `boot_params.screen_info.orig_video_cols` and `boot_params.screen_info.orig_video_lines`. After this, execution of `store_mode_params` is finished.
|
||||
|
||||
Next we can see the `save_screen` function which just saves screen content to the heap. This function collects all data which we got in the previous functions like rows and columns amount etc. and stores it in the `saved_screen` structure, which is defined as:
|
||||
Next we can see the `save_screen` function which just saves the contents of the screen to the heap. This function collects all the data which we got in the previous functions (like the rows and columns, and stuff) and stores it in the `saved_screen` structure, which is defined as:
|
||||
|
||||
```C
|
||||
static struct saved_screen {
|
||||
@ -175,7 +175,7 @@ if (!heap_free(saved.x*saved.y*sizeof(u16)+512))
|
||||
|
||||
and allocates space in the heap if it is enough and stores `saved_screen` in it.
|
||||
|
||||
The next call is `probe_cards(0)` from [arch/x86/boot/video-mode.c](https://github.com/0xAX/linux/blob/0e271fd59fe9e6d8c932309e7a42a4519c5aac6f/arch/x86/boot/video-mode.c#L33). It goes over all video_cards and collects the number of modes provided by the cards. Here is the interesting moment, we can see the loop:
|
||||
The next call is `probe_cards(0)` from [arch/x86/boot/video-mode.c](https://github.com/0xAX/linux/blob/0e271fd59fe9e6d8c932309e7a42a4519c5aac6f/arch/x86/boot/video-mode.c#L33). It goes over all video_cards and collects the number of modes provided by the cards. Here is the interesting part, we can see the loop:
|
||||
|
||||
```C
|
||||
for (card = video_cards; card < video_cards_end; card++) {
|
||||
@ -183,7 +183,7 @@ for (card = video_cards; card < video_cards_end; card++) {
|
||||
}
|
||||
```
|
||||
|
||||
but `video_cards` is not declared anywhere. The answer is simple: every video mode presented in the x86 kernel setup code has definition like this:
|
||||
but `video_cards` is not declared anywhere. The answer is simple: every video mode presented in the x86 kernel setup code has a definition that looks like this:
|
||||
|
||||
```C
|
||||
static __videocard video_vga = {
|
||||
@ -199,7 +199,7 @@ where `__videocard` is a macro:
|
||||
#define __videocard struct card_info __attribute__((used,section(".videocards")))
|
||||
```
|
||||
|
||||
which means that `card_info` structure:
|
||||
which means that the `card_info` structure:
|
||||
|
||||
```C
|
||||
struct card_info {
|
||||
@ -224,13 +224,13 @@ is in the `.videocards` segment. Let's look in the [arch/x86/boot/setup.ld](http
|
||||
}
|
||||
```
|
||||
|
||||
It means that `video_cards` is just a memory address and all `card_info` structures are placed in this segment. It means that all `card_info` structures are placed between `video_cards` and `video_cards_end`, so we can use it in a loop to go over all of it. After `probe_cards` executes we have all structures like `static __videocard video_vga` with filled `nmodes` (number of video modes).
|
||||
It means that `video_cards` is just a memory address and all `card_info` structures are placed in this segment. It means that all `card_info` structures are placed between `video_cards` and `video_cards_end`, so we can use a loop to go over all of it. After `probe_cards` executes we have a bunch of structures like `static __videocard video_vga` with the `nmodes` (the number of video modes) filled in.
|
||||
|
||||
After `probe_cards` execution is finished, we move to the main loop in the `set_video` function. There is an infinite loop which tries to set up video mode with the `set_mode` function or prints a menu if we passed `vid_mode=ask` to the kernel command line or video mode is undefined.
|
||||
After the `probe_cards` function is done, we move to the main loop in the `set_video` function. There is an infinite loop which tries to set up the video mode with the `set_mode` function or prints a menu if we passed `vid_mode=ask` to the kernel command line or if video mode is undefined.
|
||||
|
||||
The `set_mode` function is defined in [video-mode.c](https://github.com/0xAX/linux/blob/0a07b238e5f488b459b6113a62e06b6aab017f71/arch/x86/boot/video-mode.c#L147) and gets only one parameter, `mode`, which is the number of video modes (we got it from the menu or in the start of `setup_video`, from the kernel setup header).
|
||||
The `set_mode` function is defined in [video-mode.c](https://github.com/0xAX/linux/blob/0a07b238e5f488b459b6113a62e06b6aab017f71/arch/x86/boot/video-mode.c#L147) and gets only one parameter, `mode`, which is the number of video modes (we got this value from the menu or in the start of `setup_video`, from the kernel setup header).
|
||||
|
||||
The `set_mode` function checks the `mode` and calls the `raw_set_mode` function. The `raw_set_mode` calls the `set_mode` function for the selected card i.e. `card->set_mode(struct mode_info*)`. We can get access to this function from the `card_info` structure. Every video mode defines this structure with values filled depending upon the video mode (for example for `vga` it is the `video_vga.set_mode` function. See above example of `card_info` structure for `vga`). `video_vga.set_mode` is `vga_set_mode`, which checks the vga mode and calls the respective function:
|
||||
The `set_mode` function checks the `mode` and calls the `raw_set_mode` function. The `raw_set_mode` calls the selected card's `set_mode` function, i.e. `card->set_mode(struct mode_info*)`. We can get access to this function from the `card_info` structure. Every video mode defines this structure with values filled depending upon the video mode (for example for `vga` it is the `video_vga.set_mode` function. See the above example of the `card_info` structure for `vga`). `video_vga.set_mode` is `vga_set_mode`, which checks the vga mode and calls the respective function:
|
||||
|
||||
```C
|
||||
static int vga_set_mode(struct mode_info *mode)
|
||||
@ -268,22 +268,22 @@ static int vga_set_mode(struct mode_info *mode)
|
||||
|
||||
Every function which sets up video mode just calls the `0x10` BIOS interrupt with a certain value in the `AH` register.
|
||||
|
||||
After we have set video mode, we pass it to `boot_params.hdr.vid_mode`.
|
||||
After we have set the video mode, we pass it to `boot_params.hdr.vid_mode`.
|
||||
|
||||
Next `vesa_store_edid` is called. This function simply stores the [EDID](https://en.wikipedia.org/wiki/Extended_Display_Identification_Data) (**E**xtended **D**isplay **I**dentification **D**ata) information for kernel use. After this `store_mode_params` is called again. Lastly, if `do_restore` is set, the screen is restored to an earlier state.
|
||||
Next, `vesa_store_edid` is called. This function simply stores the [EDID](https://en.wikipedia.org/wiki/Extended_Display_Identification_Data) (**E**xtended **D**isplay **I**dentification **D**ata) information for kernel use. After this `store_mode_params` is called again. Lastly, if `do_restore` is set, the screen is restored to an earlier state.
|
||||
|
||||
After this, we have set video mode and now we can switch to the protected mode.
|
||||
Having done this, the video mode setup is complete and now we can switch to the protected mode.
|
||||
|
||||
Last preparation before transition into protected mode
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
We can see the last function call - `go_to_protected_mode` - in [main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/main.c#L184). As the comment says: `Do the last things and invoke protected mode`, so let's see these last things and switch into protected mode.
|
||||
We can see the last function call - `go_to_protected_mode` - in [main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/main.c#L184). As the comment says: `Do the last things and invoke protected mode`, so let's see what these last things are and switch into protected mode.
|
||||
|
||||
The `go_to_protected_mode` is defined in [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/pm.c#L104). It contains some functions which make the last preparations before we can jump into protected mode, so let's look at it and try to understand what they do and how it works.
|
||||
The `go_to_protected_mode` function is defined in [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/pm.c#L104). It contains some functions which make the last preparations before we can jump into protected mode, so let's look at it and try to understand what it does and how it works.
|
||||
|
||||
First is the call to the `realmode_switch_hook` function in `go_to_protected_mode`. This function invokes the real mode switch hook if it is present and disables [NMI](http://en.wikipedia.org/wiki/Non-maskable_interrupt). Hooks are used if the bootloader runs in a hostile environment. You can read more about hooks in the [boot protocol](https://www.kernel.org/doc/Documentation/x86/boot.txt) (see **ADVANCED BOOT LOADER HOOKS**).
|
||||
|
||||
The `realmode_switch` hook presents a pointer to the 16-bit real mode far subroutine which disables non-maskable interrupts. After `realmode_switch` hook (it isn't present for me) is checked, disabling of Non-Maskable Interrupts(NMI) occurs:
|
||||
The `realmode_switch` hook presents a pointer to the 16-bit real mode far subroutine which disables non-maskable interrupts. After the `realmode_switch` hook (it isn't present for me) is checked, Non-Maskable Interrupts(NMI) is disabled:
|
||||
|
||||
```assembly
|
||||
asm volatile("cli");
|
||||
@ -291,11 +291,11 @@ outb(0x80, 0x70); /* Disable NMI */
|
||||
io_delay();
|
||||
```
|
||||
|
||||
At first, there is an inline assembly instruction with a `cli` instruction which clears the interrupt flag (`IF`). After this, external interrupts are disabled. The next line disables NMI (non-maskable interrupt).
|
||||
At first, there is an inline assembly statement with a `cli` instruction which clears the interrupt flag (`IF`). After this, external interrupts are disabled. The next line disables NMI (non-maskable interrupt).
|
||||
|
||||
An interrupt is a signal to the CPU which is emitted by hardware or software. After getting the signal, the CPU suspends the current instruction sequence, saves its state and transfers control to the interrupt handler. After the interrupt handler has finished it's work, it transfers control to the interrupted instruction. Non-maskable interrupts (NMI) are interrupts which are always processed, independently of permission. It cannot be ignored and is typically used to signal for non-recoverable hardware errors. We will not dive into details of interrupts now but will discuss it in the next posts.
|
||||
An interrupt is a signal to the CPU which is emitted by hardware or software. After getting such a signal, the CPU suspends the current instruction sequence, saves its state and transfers control to the interrupt handler. After the interrupt handler has finished it's work, it transfers control back to the interrupted instruction. Non-maskable interrupts (NMI) are interrupts which are always processed, independently of permission. They cannot be ignored and are typically used to signal for non-recoverable hardware errors. We will not dive into the details of interrupts now but we will be discussing them in the coming posts.
|
||||
|
||||
Let's get back to the code. We can see that second line is writing `0x80` (disabled bit) byte to `0x70` (CMOS Address register). After that, a call to the `io_delay` function occurs. `io_delay` causes a small delay and looks like:
|
||||
Let's get back to the code. We can see in the second line that we are writing the byte `0x80` (disabled bit) to `0x70` (the CMOS Address register). After that, a call to the `io_delay` function occurs. `io_delay` causes a small delay and looks like:
|
||||
|
||||
```C
|
||||
static inline void io_delay(void)
|
||||
@ -305,9 +305,9 @@ static inline void io_delay(void)
|
||||
}
|
||||
```
|
||||
|
||||
To output any byte to the port `0x80` should delay exactly 1 microsecond. So we can write any value (value from `AL` register in our case) to the `0x80` port. After this delay `realmode_switch_hook` function has finished execution and we can move to the next function.
|
||||
To output any byte to the port `0x80` should delay exactly 1 microsecond. So we can write any value (the value from `AL` in our case) to the `0x80` port. After this delay the `realmode_switch_hook` function has finished execution and we can move to the next function.
|
||||
|
||||
The next function is `enable_a20`, which enables [A20 line](http://en.wikipedia.org/wiki/A20_line). This function is defined in [arch/x86/boot/a20.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/a20.c) and it tries to enable the A20 gate with different methods. The first is the `a20_test_short` function which checks if A20 is already enabled or not with the `a20_test` function:
|
||||
The next function is `enable_a20`, which enables the [A20 line](http://en.wikipedia.org/wiki/A20_line). This function is defined in [arch/x86/boot/a20.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/a20.c) and it tries to enable the A20 gate with different methods. The first is the `a20_test_short` function which checks if A20 is already enabled or not with the `a20_test` function:
|
||||
|
||||
```C
|
||||
static int a20_test(int loops)
|
||||
@ -333,11 +333,13 @@ static int a20_test(int loops)
|
||||
}
|
||||
```
|
||||
|
||||
First of all, we put `0x0000` in the `FS` register and `0xffff` in the `GS` register. Next, we read the value in address `A20_TEST_ADDR` (it is `0x200`) and put this value into the `saved` variable and `ctr`.
|
||||
First of all, we put `0x0000` in the `FS` register and `0xffff` in the `GS` register. Next, we read the value at the address `A20_TEST_ADDR` (it is `0x200`) and put this value into the variables `saved` and `ctr`.
|
||||
|
||||
Next, we write an updated `ctr` value into `fs:gs` with the `wrfs32` function, then delay for 1ms, and then read the value from the `GS` register by address `A20_TEST_ADDR+0x10`, if it's not zero we already have enabled the A20 line. If A20 is disabled, we try to enable it with a different method which you can find in the `a20.c`. For example with call of `0x15` BIOS interrupt with `AH=0x2041` etc.
|
||||
Next, we write an updated `ctr` value into `fs:A20_TEST_ADDR` or `fs:0x200` with the `wrfs32` function, then delay for 1ms, and then read the value from the `GS` register into the address `A20_TEST_ADDR+0x10`. In a case when `a20` line is disabled, the address will be overlapped, in other case if it's not zero `a20` line is already enabled the A20 line.
|
||||
|
||||
If the `enabled_a20` function finished with fail, print an error message and call function `die`. You can remember it from the first source code file where we started - [arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/header.S):
|
||||
If A20 is disabled, we try to enable it with a different method which you can find in `a20.c`. For example, it can be done with a call to the `0x15` BIOS interrupt with `AH=0x2041`.
|
||||
|
||||
If the `enabled_a20` function finished with a failure, print an error message and call the function `die`. You can remember it from the first source code file where we started - [arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/header.S):
|
||||
|
||||
```assembly
|
||||
die:
|
||||
@ -366,10 +368,10 @@ This masks all interrupts on the secondary PIC (Programmable Interrupt Controlle
|
||||
|
||||
And after all of these preparations, we can see the actual transition into protected mode.
|
||||
|
||||
Set up Interrupt Descriptor Table
|
||||
Set up the Interrupt Descriptor Table
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Now we set up the Interrupt Descriptor table (IDT). `setup_idt`:
|
||||
Now we set up the Interrupt Descriptor table (IDT) in the `setup_idt` function:
|
||||
|
||||
```C
|
||||
static void setup_idt(void)
|
||||
@ -379,7 +381,7 @@ static void setup_idt(void)
|
||||
}
|
||||
```
|
||||
|
||||
which sets up the Interrupt Descriptor Table (describes interrupt handlers and etc.). For now, the IDT is not installed (we will see it later), but now we just the load IDT with the `lidtl` instruction. `null_idt` contains address and size of IDT, but now they are just zero. `null_idt` is a `gdt_ptr` structure, it as defined as:
|
||||
which sets up the Interrupt Descriptor Table (describes interrupt handlers and etc.). For now, the IDT is not installed (we will see it later), but now we just load the IDT with the `lidtl` instruction. `null_idt` contains the address and size of the IDT, but for now they are just zero. `null_idt` is a `gdt_ptr` structure, it is defined as:
|
||||
|
||||
```C
|
||||
struct gdt_ptr {
|
||||
@ -393,7 +395,7 @@ where we can see the 16-bit length(`len`) of the IDT and the 32-bit pointer to i
|
||||
Set up Global Descriptor Table
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Next is the setup of the Global Descriptor Table (GDT). We can see the `setup_gdt` function which sets up GDT (you can read about it in the [Kernel booting process. Part 2.](linux-bootstrap-2.md#protected-mode)). There is a definition of the `boot_gdt` array in this function, which contains the definition of the three segments:
|
||||
Next is the setup of the Global Descriptor Table (GDT). We can see the `setup_gdt` function which sets up the GDT (you can read about it in the post [Kernel booting process. Part 2.](linux-bootstrap-2.md#protected-mode)). There is a definition of the `boot_gdt` array in this function, which contains the definition of the three segments:
|
||||
|
||||
```C
|
||||
static const u64 boot_gdt[] __attribute__((aligned(16))) = {
|
||||
@ -403,7 +405,7 @@ static const u64 boot_gdt[] __attribute__((aligned(16))) = {
|
||||
};
|
||||
```
|
||||
|
||||
for code, data and TSS (Task State Segment). We will not use the task state segment for now, it was added there to make Intel VT happy as we can see in the comment line (if you're interested you can find commit which describes it - [here](https://github.com/torvalds/linux/commit/88089519f302f1296b4739be45699f06f728ec31)). Let's look at `boot_gdt`. First of all note that it has the `__attribute__((aligned(16)))` attribute. It means that this structure will be aligned by 16 bytes.
|
||||
for code, data and TSS (Task State Segment). We will not use the task state segment for now, it was added there to make Intel VT happy as we can see in the comment line (if you're interested you can find the commit which describes it - [here](https://github.com/torvalds/linux/commit/88089519f302f1296b4739be45699f06f728ec31)). Let's look at `boot_gdt`. First of all note that it has the `__attribute__((aligned(16)))` attribute. It means that this structure will be aligned by 16 bytes.
|
||||
|
||||
Let's look at a simple example:
|
||||
|
||||
@ -430,7 +432,7 @@ int main(void)
|
||||
}
|
||||
```
|
||||
|
||||
Technically a structure which contains one `int` field must be 4 bytes, but here `aligned` structure will be 16 bytes:
|
||||
Technically a structure which contains one `int` field must be 4 bytes in size, but an `aligned` structure will need 16 bytes to store in memory:
|
||||
|
||||
```
|
||||
$ gcc test.c -o test && test
|
||||
@ -438,9 +440,9 @@ Not aligned - 4
|
||||
Aligned - 16
|
||||
```
|
||||
|
||||
The `GDT_ENTRY_BOOT_CS` has index - 2 here, `GDT_ENTRY_BOOT_DS` is `GDT_ENTRY_BOOT_CS + 1` and etc. It starts from 2, because first is a mandatory null descriptor (index - 0) and the second is not used (index - 1).
|
||||
The `GDT_ENTRY_BOOT_CS` has index - 2 here, `GDT_ENTRY_BOOT_DS` is `GDT_ENTRY_BOOT_CS + 1` and etc. It starts from 2, because the first is a mandatory null descriptor (index - 0) and the second is not used (index - 1).
|
||||
|
||||
The `GDT_ENTRY` is a macro which takes flags, base, limit and builds GDT entry. For example, let's look at the code segment entry. `GDT_ENTRY` takes following values:
|
||||
`GDT_ENTRY` is a macro which takes flags, base, limit and builds a GDT entry. For example, let's look at the code segment entry. `GDT_ENTRY` takes the following values:
|
||||
|
||||
* base - 0
|
||||
* limit - 0xfffff
|
||||
@ -481,7 +483,7 @@ Next we get a pointer to the GDT with:
|
||||
gdt.ptr = (u32)&boot_gdt + (ds() << 4);
|
||||
```
|
||||
|
||||
Here we just get the address of `boot_gdt` and add it to the address of the data segment left-shifted by 4 bits (remember we're in the real mode now).
|
||||
Here we just get the address of `boot_gdt` and add it to the address of the data segment left-shifted by 4 bits (remember we're in real mode now).
|
||||
|
||||
Lastly we execute the `lgdtl` instruction to load the GDT into the GDTR register:
|
||||
|
||||
@ -492,7 +494,7 @@ asm volatile("lgdtl %0" : : "m" (gdt));
|
||||
Actual transition into protected mode
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the `go_to_protected_mode` function. We loaded IDT, GDT, disable interruptions and now can switch the CPU into protected mode. The last step is calling the `protected_mode_jump` function with two parameters:
|
||||
This is the end of the `go_to_protected_mode` function. We loaded the IDT and GDT, disabled interrupts and now can switch the CPU into protected mode. The last step is calling the `protected_mode_jump` function with two parameters:
|
||||
|
||||
```C
|
||||
protected_mode_jump(boot_params.hdr.code32_start, (u32)&boot_params + (ds() << 4));
|
||||
@ -502,12 +504,12 @@ which is defined in [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/b
|
||||
|
||||
It takes two parameters:
|
||||
|
||||
* address of protected mode entry point
|
||||
* address of the protected mode entry point
|
||||
* address of `boot_params`
|
||||
|
||||
Let's look inside `protected_mode_jump`. As I wrote above, you can find it in `arch/x86/boot/pmjump.S`. The first parameter will be in the `eax` register and the second one is in `edx`.
|
||||
|
||||
First of all, we put the address of `boot_params` in the `esi` register and the address of code segment register `cs` (0x1000) in `bx`. After this, we shift `bx` by 4 bits and add it to the memory location labeled `2` (which is `bx << 4 + in_pm32`, the physical address to jump after transitioned to 32-bit mode) and jump to label `1`. Next we put data segment and task state segment in the `cx` and `di` registers with:
|
||||
First of all, we put the address of `boot_params` in the `esi` register and the address of the code segment register `cs` (0x1000) in `bx`. After this, we shift `bx` by 4 bits and add it to the memory location labeled `2` (which is `bx << 4 + in_pm32`, the physical address to jump after transitioned to 32-bit mode) and jump to label `1`. Next we put the data segment and the task state segment in the `cx` and `di` registers with:
|
||||
|
||||
```assembly
|
||||
movw $__BOOT_DS, %cx
|
||||
@ -537,16 +539,16 @@ where:
|
||||
* `0x66` is the operand-size prefix which allows us to mix 16-bit and 32-bit code
|
||||
* `0xea` - is the jump opcode
|
||||
* `in_pm32` is the segment offset
|
||||
* `__BOOT_CS` is the code segment
|
||||
* `__BOOT_CS` is the code segment we want to jump to.
|
||||
|
||||
After this we are finally in the protected mode:
|
||||
After this we are finally in protected mode:
|
||||
|
||||
```assembly
|
||||
.code32
|
||||
.section ".text32","ax"
|
||||
```
|
||||
|
||||
Let's look at the first steps in protected mode. First of all we set up the data segment with:
|
||||
Let's look at the first steps taken in protected mode. First of all we set up the data segment with:
|
||||
|
||||
```assembly
|
||||
movl %ecx, %ds
|
||||
@ -558,13 +560,13 @@ movl %ecx, %ss
|
||||
|
||||
If you paid attention, you can remember that we saved `$__BOOT_DS` in the `cx` register. Now we fill it with all segment registers besides `cs` (`cs` is already `__BOOT_CS`).
|
||||
|
||||
And setup valid stack for debugging purposes:
|
||||
And setup a valid stack for debugging purposes:
|
||||
|
||||
```assembly
|
||||
addl %ebx, %esp
|
||||
```
|
||||
|
||||
The last step before jump into 32-bit entry point is clearing of general purpose registers:
|
||||
The last step before the jump into 32-bit entry point is to clear the general purpose registers:
|
||||
|
||||
```assembly
|
||||
xorl %ecx, %ecx
|
||||
@ -582,12 +584,12 @@ jmpl *%eax
|
||||
|
||||
Remember that `eax` contains the address of the 32-bit entry (we passed it as the first parameter into `protected_mode_jump`).
|
||||
|
||||
That's all. We're in the protected mode and stop at it's entry point. We will see what happens next in the next part.
|
||||
That's all. We're in protected mode and stop at its entry point. We will see what happens next in the next part.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the third part about linux kernel insides. In next part, we will see first steps in the protected mode and transition into the [long mode](http://en.wikipedia.org/wiki/Long_mode).
|
||||
This is the end of the third part about linux kernel insides. In the next part, we will look at the first steps we take in protected mode and transition into [long mode](http://en.wikipedia.org/wiki/Long_mode).
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
|
@ -151,7 +151,7 @@ Bit 6 (write): KEEP_SEGMENTS
|
||||
So, if the `KEEP_SEGMENTS` bit is not set in the `loadflags`, we need to set `ds`, `ss` and `es` segment registers to the index of data segment with base `0`. That we do:
|
||||
|
||||
```C
|
||||
testb $(1 << 6), BP_loadflags(%esi)
|
||||
testb $KEEP_SEGMENTS, BP_loadflags(%esi)
|
||||
jnz 1f
|
||||
|
||||
cli
|
||||
@ -337,8 +337,6 @@ When we are using position-independent code an address is obtained by adding the
|
||||
jge 1f
|
||||
#endif
|
||||
movl $LOAD_PHYSICAL_ADDR, %ebx
|
||||
1:
|
||||
addl $z_extract_offset, %ebx
|
||||
```
|
||||
|
||||
Remember that the value of the `ebp` register is the physical address of the `startup_32` label. If the `CONFIG_RELOCATABLE` kernel configuration option is enabled during kernel configuration, we put this address in the `ebx` register, align it to a multiple of `2MB` and compare it with the `LOAD_PHYSICAL_ADDR` value. The `LOAD_PHYSICAL_ADDR` macro is defined in the [arch/x86/include/asm/boot.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/boot.h) header file and it looks like this:
|
||||
@ -354,9 +352,10 @@ As we can see it just expands to the aligned `CONFIG_PHYSICAL_ALIGN` value which
|
||||
After all of these calculations, we will have `ebp` which contains the address where we loaded it and `ebx` set to the address of where kernel will be moved after decompression. But that is not the end. The compressed kernel image should be moved to the end of the decompression buffer to simplify calculations where kernel will be located later. For this:
|
||||
|
||||
```assembly
|
||||
movl BP_init_size(%esi), %eax
|
||||
subl $_end, %eax
|
||||
addl %eax, %ebx
|
||||
1:
|
||||
movl BP_init_size(%esi), %eax
|
||||
subl $_end, %eax
|
||||
addl %eax, %ebx
|
||||
```
|
||||
|
||||
we put value from the `boot_params.BP_init_size` (or kernel setup header value from the `hdr.init_size`) to the `eax` register. The `BP_init_size` contains larger value between compressed and uncompressed [vmlinux](https://en.wikipedia.org/wiki/Vmlinux). Next we subtract address of the `_end` symbol from this value and add the result of subtraction to `ebx` register which will stores base address for kernel decompression.
|
||||
@ -377,6 +376,11 @@ To understand the magic with `gdt` offsets we need to look at the definition of
|
||||
|
||||
```assembly
|
||||
.data
|
||||
gdt64:
|
||||
.word gdt_end - gdt
|
||||
.long 0
|
||||
.word 0
|
||||
.quad 0
|
||||
gdt:
|
||||
.word gdt_end - gdt
|
||||
.long gdt
|
||||
@ -507,7 +511,7 @@ We put the base address of the page directory pointer which is `4096` or `0x1000
|
||||
leal pgtable + 0x2000(%ebx), %edi
|
||||
movl $0x00000183, %eax
|
||||
movl $2048, %ecx
|
||||
1: movl %eax, 0(%edi)
|
||||
1: movl %eax, 0(%edi)
|
||||
addl $0x00200000, %eax
|
||||
addl $8, %edi
|
||||
decl %ecx
|
||||
@ -555,7 +559,7 @@ After this we push this address to the stack and enable paging by setting `PG` a
|
||||
|
||||
```assembly
|
||||
pushl %eax
|
||||
movl $(X86_CR0_PG | X86_CR0_PE), %eax
|
||||
movl $(X86_CR0_PG | X86_CR0_PE), %eax
|
||||
movl %eax, %cr0
|
||||
```
|
||||
|
||||
|
@ -62,15 +62,21 @@ The next step is computation of difference between where the kernel was compiled
|
||||
|
||||
The `rbp` contains the decompressed kernel start address and after this code executes `rbx` register will contain address to relocate the kernel code for decompression. We already saw code like this in the `startup_32` ( you can read about it in the previous part - [Calculate relocation address](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-4.md#calculate-relocation-address)), but we need to do this calculation again because the bootloader can use 64-bit boot protocol and `startup_32` just will not be executed in this case.
|
||||
|
||||
In the next step we can see setup of the stack pointer and resetting of the flags register:
|
||||
In the next step we can see setup of the stack pointer, resetting of the flags register and setup `GDT` again because of in a case of `64-bit` protocol `32-bit` code segment can be omitted by bootloader:
|
||||
|
||||
```assembly
|
||||
leaq boot_stack_end(%rbx), %rsp
|
||||
leaq boot_stack_end(%rbx), %rsp
|
||||
|
||||
leaq gdt(%rip), %rax
|
||||
movq %rax, gdt64+2(%rip)
|
||||
lgdt gdt64(%rip)
|
||||
|
||||
pushq $0
|
||||
popfq
|
||||
```
|
||||
|
||||
If you look at the Linux kernel source code after `lgdt gdt64(%rip)` instruction, you will see that there is some additional code. This code builds trampoline to enable [5-level pagging](https://lwn.net/Articles/708526/) if need. We will consider only 4-level paging in this books, so this code will be omitted.
|
||||
|
||||
As you can see above, the `rbx` register contains the start address of the kernel decompressor code and we just put this address with `boot_stack_end` offset to the `rsp` register which represents pointer to the top of the stack. After this step, the stack will be correct. You can find definition of the `boot_stack_end` in the end of [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S) assembly source code file:
|
||||
|
||||
```assembly
|
||||
@ -187,7 +193,7 @@ At the end, we can see the call to the `extract_kernel` function:
|
||||
popq %rsi
|
||||
```
|
||||
|
||||
Again we set `rdi` to a pointer to the `boot_params` structure and preserve it on the stack. In the same time we set `rsi` to point to the area which should be usedd for kernel uncompression. The last step is preparation of the `extract_kernel` parameters and call of this function which will uncompres the kernel. The `extract_kernel` function is defined in the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/misc.c) source code file and takes six arguments:
|
||||
Again we set `rdi` to a pointer to the `boot_params` structure and preserve it on the stack. In the same time we set `rsi` to point to the area which should be used for kernel uncompression. The last step is preparation of the `extract_kernel` parameters and call of this function which will uncompres the kernel. The `extract_kernel` function is defined in the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/misc.c) source code file and takes six arguments:
|
||||
|
||||
* `rmode` - pointer to the [boot_params](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973//arch/x86/include/uapi/asm/bootparam.h#L114) structure which is filled by bootloader or during early kernel initialization;
|
||||
* `heap` - pointer to the `boot_heap` which represents start address of the early boot heap;
|
||||
@ -225,167 +231,45 @@ boot_heap:
|
||||
|
||||
where the `BOOT_HEAP_SIZE` is macro which expands to `0x10000` (`0x400000` in a case of `bzip2` kernel) and represents the size of the heap.
|
||||
|
||||
After heap pointers initialization, the next step is the call of the `choose_random_location` function from [arch/x86/boot/compressed/kaslr.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/kaslr.c#L425) source code file. As we can guess from the function name, it chooses the memory location where the kernel image will be decompressed. It may look weird that we need to find or even `choose` location where to decompress the compressed kernel image, but the Linux kernel supports [kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization) which allows decompression of the kernel into a random address, for security reasons. Let's open the [arch/x86/boot/compressed/kaslr.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/kaslr.c#L425) source code file and look at `choose_random_location`.
|
||||
After heap pointers initialization, the next step is the call of the `choose_random_location` function from [arch/x86/boot/compressed/kaslr.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/kaslr.c#L425) source code file. As we can guess from the function name, it chooses the memory location where the kernel image will be decompressed. It may look weird that we need to find or even `choose` location where to decompress the compressed kernel image, but the Linux kernel supports [kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization) which allows decompression of the kernel into a random address, for security reasons.
|
||||
|
||||
First, `choose_random_location` tries to find the `nokaslr` option in the Linux kernel command line:
|
||||
We will not consider randomization of the Linux kernel load address in this part, but will do it in the next part.
|
||||
|
||||
Now let's back to [misc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/misc.c#L404). After getting the address for the kernel image, there need to be some checks to be sure that the retrieved random address is correctly aligned and address is not wrong:
|
||||
|
||||
```C
|
||||
if (cmdline_find_option_bool("nokaslr")) {
|
||||
debug_putstr("KASLR disabled by cmdline...\n");
|
||||
return;
|
||||
}
|
||||
if ((unsigned long)output & (MIN_KERNEL_ALIGN - 1))
|
||||
error("Destination physical address inappropriately aligned");
|
||||
|
||||
if (virt_addr & (MIN_KERNEL_ALIGN - 1))
|
||||
error("Destination virtual address inappropriately aligned");
|
||||
|
||||
if (heap > 0x3fffffffffffUL)
|
||||
error("Destination address too large");
|
||||
|
||||
if (virt_addr + max(output_len, kernel_total_size) > KERNEL_IMAGE_SIZE)
|
||||
error("Destination virtual address is beyond the kernel mapping area");
|
||||
|
||||
if ((unsigned long)output != LOAD_PHYSICAL_ADDR)
|
||||
error("Destination address does not match LOAD_PHYSICAL_ADDR");
|
||||
|
||||
if (virt_addr != LOAD_PHYSICAL_ADDR)
|
||||
error("Destination virtual address changed when not relocatable");
|
||||
```
|
||||
|
||||
and exit if the option is present.
|
||||
|
||||
For now, let's assume the kernel was configured with randomization enabled and try to understand what `kASLR` is. We can find information about it in the [documentation](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/kernel-parameters.txt):
|
||||
|
||||
```
|
||||
kaslr/nokaslr [X86]
|
||||
|
||||
Enable/disable kernel and module base offset ASLR
|
||||
(Address Space Layout Randomization) if built into
|
||||
the kernel. When CONFIG_HIBERNATION is selected,
|
||||
kASLR is disabled by default. When kASLR is enabled,
|
||||
hibernation will be disabled.
|
||||
```
|
||||
|
||||
It means that we can pass the `kaslr` option to the kernel's command line and get a random address for the decompressed kernel (you can read more about ASLR [here](https://en.wikipedia.org/wiki/Address_space_layout_randomization)). So, our current goal is to find random address where we can `safely` to decompress the Linux kernel. I repeat: `safely`. What does it mean in this context? You may remember that besides the code of decompressor and directly the kernel image, there are some unsafe places in memory. For example, the [initrd](https://en.wikipedia.org/wiki/Initrd) image is in memory too, and we must not overlap it with the decompressed kernel.
|
||||
|
||||
The next function will help us to build identity mappig pages to avoid non-safe places in RAM and decompress kernel. And after this we should find a safe place where we can decompress kernel. This function is `mem_avoid_init`. It defined in the same source code [file](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/kaslr.c), and takes four arguments that we already saw in the `extract_kernel` function:
|
||||
|
||||
* `input_data` - pointer to the start of the compressed kernel, or in other words, the pointer to `arch/x86/boot/compressed/vmlinux.bin.bz2`;
|
||||
* `input_len` - the size of the compressed kernel;
|
||||
* `output` - the start address of the future decompressed kernel;
|
||||
|
||||
The main point of this function is to fill array of the `mem_vector` structures:
|
||||
|
||||
```C
|
||||
#define MEM_AVOID_MAX 5
|
||||
|
||||
static struct mem_vector mem_avoid[MEM_AVOID_MAX];
|
||||
```
|
||||
|
||||
where the `mem_vector` structure contains information about unsafe memory regions:
|
||||
|
||||
```C
|
||||
struct mem_vector {
|
||||
unsigned long start;
|
||||
unsigned long size;
|
||||
};
|
||||
```
|
||||
|
||||
The implementation of the `mem_avoid_init` is pretty simple. Let's look on the part of this function:
|
||||
|
||||
```C
|
||||
...
|
||||
...
|
||||
...
|
||||
initrd_start = (u64)real_mode->ext_ramdisk_image << 32;
|
||||
initrd_start |= real_mode->hdr.ramdisk_image;
|
||||
initrd_size = (u64)real_mode->ext_ramdisk_size << 32;
|
||||
initrd_size |= real_mode->hdr.ramdisk_size;
|
||||
mem_avoid[1].start = initrd_start;
|
||||
mem_avoid[1].size = initrd_size;
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
Here we can see calculation of the [initrd](http://en.wikipedia.org/wiki/Initrd) start address and size. The `ext_ramdisk_image` is the high `32-bits` of the `ramdisk_image` field from the setup header, and `ext_ramdisk_size` is the high 32-bits of the `ramdisk_size` field from the [boot protocol](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/x86/boot.txt):
|
||||
|
||||
```
|
||||
Offset Proto Name Meaning
|
||||
/Size
|
||||
...
|
||||
...
|
||||
...
|
||||
0218/4 2.00+ ramdisk_image initrd load address (set by boot loader)
|
||||
021C/4 2.00+ ramdisk_size initrd size (set by boot loader)
|
||||
...
|
||||
```
|
||||
|
||||
And `ext_ramdisk_image` and `ext_ramdisk_size` can be found in the [Documentation/x86/zero-page.txt](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/x86/zero-page.txt):
|
||||
|
||||
```
|
||||
Offset Proto Name Meaning
|
||||
/Size
|
||||
...
|
||||
...
|
||||
...
|
||||
0C0/004 ALL ext_ramdisk_image ramdisk_image high 32bits
|
||||
0C4/004 ALL ext_ramdisk_size ramdisk_size high 32bits
|
||||
...
|
||||
```
|
||||
|
||||
So we're taking `ext_ramdisk_image` and `ext_ramdisk_size`, shifting them left on `32` (now they will contain low 32-bits in the high 32-bit bits) and getting start address of the `initrd` and size of it. After this we store these values in the `mem_avoid` array.
|
||||
|
||||
The next step after we've collected all unsafe memory regions in the `mem_avoid` array will be searching for a random address that does not overlap with the unsafe regions, using the `find_random_phys_addr` function.
|
||||
|
||||
First of all we can see the alignment of the output address in the `find_random_addr` function:
|
||||
|
||||
```C
|
||||
minimum = ALIGN(minimum, CONFIG_PHYSICAL_ALIGN);
|
||||
```
|
||||
|
||||
You can remember `CONFIG_PHYSICAL_ALIGN` configuration option from the previous part. This option provides the value to which kernel should be aligned and it is `0x200000` by default. Once we have the aligned output address, we go through the memory regions which we got with the help of the BIOS [e820](https://en.wikipedia.org/wiki/E820) service and collect regions suitable for the decompressed kernel image:
|
||||
|
||||
```C
|
||||
process_e820_entries(minimum, image_size);
|
||||
```
|
||||
|
||||
Recall that we collected `e820_entries` in the second part of the [Kernel booting process part 2](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-2.md#memory-detection). The `process_e820_entries` function does some checks that an `e820` memory region is not `non-RAM`, that the start address of the memory region is not bigger than maximum allowed `aslr` offset, and that the memory region is above the minimum load location:
|
||||
|
||||
```C
|
||||
for (i = 0; i < boot_params->e820_entries; i++) {
|
||||
...
|
||||
...
|
||||
...
|
||||
process_mem_region(®ion, minimum, image_size);
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
and calls the `process_mem_region` for acceptable memory regions. The `process_mem_region` function processes the given memory region and stores memory regions in the `slot_areas` array of `slot_area` structures which are defined.
|
||||
|
||||
```C
|
||||
#define MAX_SLOT_AREA 100
|
||||
|
||||
static struct slot_area slot_areas[MAX_SLOT_AREA];
|
||||
|
||||
struct slot_area {
|
||||
unsigned long addr;
|
||||
int num;
|
||||
};
|
||||
```
|
||||
|
||||
After the `process_mem_region` is done, we will have an array of addresses that are safe for the decompressed kernel. Then we call `slots_fetch_random` function to get a random item from this array:
|
||||
|
||||
```C
|
||||
slot = kaslr_get_random_long("Physical") % slot_max;
|
||||
|
||||
for (i = 0; i < slot_area_index; i++) {
|
||||
if (slot >= slot_areas[i].num) {
|
||||
slot -= slot_areas[i].num;
|
||||
continue;
|
||||
}
|
||||
return slot_areas[i].addr + slot * CONFIG_PHYSICAL_ALIGN;
|
||||
}
|
||||
```
|
||||
|
||||
where the `kaslr_get_random_long` function checks different CPU flags as `X86_FEATURE_RDRAND` or `X86_FEATURE_TSC` and chooses a method for getting random number (it can be the RDRAND instruction, the time stamp counter, the programmable interval timer, etc...). After retrieving the random address, execution of the `choose_random_location` is finished.
|
||||
|
||||
Now let's back to [misc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/misc.c#L404). After getting the address for the kernel image, there need to be some checks to be sure that the retrieved random address is correctly aligned and address is not wrong.
|
||||
|
||||
After all these checks we will see the familiar message:
|
||||
|
||||
```
|
||||
Decompressing Linux...
|
||||
```
|
||||
|
||||
and call the `__decompress` function which will decompress the kernel. The `__decompress` function depends on what decompression algorithm was chosen during kernel compilation:
|
||||
and call the `__decompress` function:
|
||||
|
||||
```C
|
||||
__decompress(input_data, input_len, NULL, NULL, output, output_len, NULL, error);
|
||||
```
|
||||
|
||||
which will decompress the kernel. The implementation of the `__decompress` function depends on what decompression algorithm was chosen during kernel compilation:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_KERNEL_GZIP
|
||||
@ -444,15 +328,15 @@ Elf64_Phdr *phdrs, *phdr;
|
||||
memcpy(&ehdr, output, sizeof(ehdr));
|
||||
|
||||
if (ehdr.e_ident[EI_MAG0] != ELFMAG0 ||
|
||||
ehdr.e_ident[EI_MAG1] != ELFMAG1 ||
|
||||
ehdr.e_ident[EI_MAG2] != ELFMAG2 ||
|
||||
ehdr.e_ident[EI_MAG3] != ELFMAG3) {
|
||||
error("Kernel is not a valid ELF file");
|
||||
return;
|
||||
ehdr.e_ident[EI_MAG1] != ELFMAG1 ||
|
||||
ehdr.e_ident[EI_MAG2] != ELFMAG2 ||
|
||||
ehdr.e_ident[EI_MAG3] != ELFMAG3) {
|
||||
error("Kernel is not a valid ELF file");
|
||||
return;
|
||||
}
|
||||
```
|
||||
|
||||
and if it's not valid, it prints an error message and halts. If we got a valid `ELF` file, we go through all program headers from the given `ELF` file and copy all loadable segments with correct address to the output buffer:
|
||||
and if it's not valid, it prints an error message and halts. If we got a valid `ELF` file, we go through all program headers from the given `ELF` file and copy all loadable segments with correct 2 megabytes aligned address to the output buffer:
|
||||
|
||||
```C
|
||||
for (i = 0; i < ehdr.e_phnum; i++) {
|
||||
@ -460,24 +344,33 @@ and if it's not valid, it prints an error message and halts. If we got a valid `
|
||||
|
||||
switch (phdr->p_type) {
|
||||
case PT_LOAD:
|
||||
#ifdef CONFIG_X86_64
|
||||
if ((phdr->p_align % 0x200000) != 0)
|
||||
error("Alignment of LOAD segment isn't multiple of 2MB");
|
||||
+#endif
|
||||
#ifdef CONFIG_RELOCATABLE
|
||||
dest = output;
|
||||
dest += (phdr->p_paddr - LOAD_PHYSICAL_ADDR);
|
||||
#else
|
||||
dest = (void *)(phdr->p_paddr);
|
||||
#endif
|
||||
memcpy(dest,
|
||||
output + phdr->p_offset,
|
||||
phdr->p_filesz);
|
||||
memmove(dest, output + phdr->p_offset, phdr->p_filesz);
|
||||
break;
|
||||
default:
|
||||
break;
|
||||
default: /* Ignore other PT_* */ break;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
That's all. From now on, all loadable segments are in the correct place. Implementation of the last `handle_relocations` function depends on the `CONFIG_X86_NEED_RELOCS` kernel configuration option and if it is enabled, this function adjusts addresses in the kernel image, and is called only if the `kASLR` was enabled during kernel configuration.
|
||||
That's all.
|
||||
|
||||
After the kernel is relocated, we return back from the `extract_kernel` to [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S). The address of the kernel will be in the `rax` register and we jump to it:
|
||||
From this moment, all loadable segments are in the correct place.
|
||||
|
||||
The next step after the `parse_elf` function is the call of the `handle_relocations` function. Implementation of this function depends on the `CONFIG_X86_NEED_RELOCS` kernel configuration option and if it is enabled, this function adjusts addresses in the kernel image, and is called only if the `CONFIG_RANDOMIZE_BASE` configuration option was enabled during kernel configuration. Implementation of the `handle_relocations` function is easy enough. This function subtracts value of the `LOAD_PHYSICAL_ADDR` from the value of the base load address of the kernel and thus we obtain the difference between where the kernel was linked to load and where it was actually loaded. After this we can perform kernel relocation as we know actual address where the kernel was loaded, its address where it was linked to run and relocation table which is in the end of the kernel image.
|
||||
|
||||
After the kernel is relocated, we return back from the `extract_kernel` to [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S).
|
||||
|
||||
The address of the kernel will be in the `rax` register and we jump to it:
|
||||
|
||||
```assembly
|
||||
jmp *%rax
|
||||
@ -488,9 +381,9 @@ That's all. Now we are in the kernel!
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the fifth and the last part about linux kernel booting process. We will not see posts about kernel booting anymore (maybe updates to this and previous posts), but there will be many posts about other kernel internals.
|
||||
This is the end of the fifth part about linux kernel booting process. We will not see posts about kernel booting anymore (maybe updates to this and previous posts), but there will be many posts about other kernel internals.
|
||||
|
||||
Next chapter will be about kernel initialization and we will see the first steps in the Linux kernel initialization code.
|
||||
Next chapter will describe more advanced details about linux kernel booting process, like a load address randomization and etc.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me in [twitter](https://twitter.com/0xAX).
|
||||
|
||||
@ -500,10 +393,10 @@ Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [address space layout randomization](https://en.wikipedia.org/wiki/Address_space_layout_randomization)
|
||||
* [initrd](http://en.wikipedia.org/wiki/Initrd)
|
||||
* [long mode](http://en.wikipedia.org/wiki/Long_mode)
|
||||
* [initrd](https://en.wikipedia.org/wiki/Initrd)
|
||||
* [long mode](https://en.wikipedia.org/wiki/Long_mode)
|
||||
* [bzip2](http://www.bzip.org/)
|
||||
* [RDdRand instruction](http://en.wikipedia.org/wiki/RdRand)
|
||||
* [Time Stamp Counter](http://en.wikipedia.org/wiki/Time_Stamp_Counter)
|
||||
* [Programmable Interval Timers](http://en.wikipedia.org/wiki/Intel_8253)
|
||||
* [RdRand instruction](https://en.wikipedia.org/wiki/RdRand)
|
||||
* [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter)
|
||||
* [Programmable Interval Timers](https://en.wikipedia.org/wiki/Intel_8253)
|
||||
* [Previous part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-4.md)
|
||||
|
421
Booting/linux-bootstrap-6.md
Normal file
421
Booting/linux-bootstrap-6.md
Normal file
@ -0,0 +1,421 @@
|
||||
Kernel booting process. Part 6.
|
||||
================================================================================
|
||||
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the sixth part of the `Kernel booting process` series. In the [previous part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-5.md) we have seen the end of the kernel boot process. But we have skipped some important advanced parts.
|
||||
|
||||
As you may remember the entry point of the Linux kernel is the `start_kernel` function from the [main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file started to execute at `LOAD_PHYSICAL_ADDR` address. This address depends on the `CONFIG_PHYSICAL_START` kernel configuration option which is `0x1000000` by default:
|
||||
|
||||
```
|
||||
config PHYSICAL_START
|
||||
hex "Physical address where the kernel is loaded" if (EXPERT || CRASH_DUMP)
|
||||
default "0x1000000"
|
||||
---help---
|
||||
This gives the physical address where the kernel is loaded.
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
This value may be changed during kernel configuration, but also load address can be selected as a random value. For this purpose the `CONFIG_RANDOMIZE_BASE` kernel configuration option should be enabled during kernel configuration.
|
||||
|
||||
In this case a physical address at which Linux kernel image will be decompressed and loaded will be randomized. This part considers the case when this option is enabled and load address of the kernel image will be randomized for [security reasons](https://en.wikipedia.org/wiki/Address_space_layout_randomization).
|
||||
|
||||
Initialization of page tables
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Before the kernel decompressor will start to find random memory range where the kernel will be decompressed and loaded, the identity mapped page tables should be initialized. If a [bootloader](https://en.wikipedia.org/wiki/Booting) used [16-bit or 32-bit boot protocol](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt), we already have page tables. But in any case, we may need new pages by demand if the kernel decompressor selects memory range outside of them. That's why we need to build new identity mapped page tables.
|
||||
|
||||
Yes, building of identity mapped page tables is the one of the first step during randomization of load address. But before we will consider it, let's try to remember where did we come from to this point.
|
||||
|
||||
In the [previous part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-5.md), we saw transition to [long mode](https://en.wikipedia.org/wiki/Long_mode) and jump to the kernel decompressor entry point - `extract_kernel` function. The randomization stuff starts here from the call of the:
|
||||
|
||||
```C
|
||||
void choose_random_location(unsigned long input,
|
||||
unsigned long input_size,
|
||||
unsigned long *output,
|
||||
unsigned long output_size,
|
||||
unsigned long *virt_addr)
|
||||
{}
|
||||
```
|
||||
|
||||
function. As you may see, this function takes following five parameters:
|
||||
|
||||
* `input`;
|
||||
* `input_size`;
|
||||
* `output`;
|
||||
* `output_isze`;
|
||||
* `virt_addr`.
|
||||
|
||||
Let's try to understand what these parameters are. The first `input` parameter came from parameters of the `extract_kernel` function from the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c) source code file:
|
||||
|
||||
```C
|
||||
asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
|
||||
unsigned char *input_data,
|
||||
unsigned long input_len,
|
||||
unsigned char *output,
|
||||
unsigned long output_len)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
choose_random_location((unsigned long)input_data, input_len,
|
||||
(unsigned long *)&output,
|
||||
max(output_len, kernel_total_size),
|
||||
&virt_addr);
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
This parameter is passed from assembler code:
|
||||
|
||||
```C
|
||||
leaq input_data(%rip), %rdx
|
||||
```
|
||||
|
||||
from the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S). The `input_data` is generated by the little [mkpiggy](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/mkpiggy.c) program. If you have compiled linux kernel source code under your hands, you may find the generated file by this program which should be placed in the `linux/arch/x86/boot/compressed/piggy.S`. In my case this file looks:
|
||||
|
||||
```assembly
|
||||
.section ".rodata..compressed","a",@progbits
|
||||
.globl z_input_len
|
||||
z_input_len = 6988196
|
||||
.globl z_output_len
|
||||
z_output_len = 29207032
|
||||
.globl input_data, input_data_end
|
||||
input_data:
|
||||
.incbin "arch/x86/boot/compressed/vmlinux.bin.gz"
|
||||
input_data_end:
|
||||
```
|
||||
|
||||
As you may see it contains four global symbols. The first two `z_input_len` and `z_output_len` which are sizes of compressed and uncompressed `vmlinux.bin.gz`. The third is our `input_data` and as you may see it points to linux kernel image in raw binary format (all debugging symbols, comments and relocation information are stripped). And the last `input_data_end` points to the end of the compressed linux image.
|
||||
|
||||
So, our first parameter of the `choose_random_location` function is the pointer to the compressed kernel image that is embedded into the `piggy.o` object file.
|
||||
|
||||
The second parameter of the `choose_random_location` function is the `z_input_len` that we have seen just now.
|
||||
|
||||
The third and fourth parameters of the `choose_random_location` function are address where to place decompressed kernel image and the length of decompressed kernel image respectively. The address where to put decompressed kernel came from [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) and it is address of the `startup_32` aligned to 2 megabytes boundary. The size of the decompressed kernel came from the same `piggy.S` and it is `z_output_len`.
|
||||
|
||||
The last parameter of the `choose_random_location` function is the virtual address of the kernel load address. As we may see, by default it coincides with the default physical load address:
|
||||
|
||||
```C
|
||||
unsigned long virt_addr = LOAD_PHYSICAL_ADDR;
|
||||
```
|
||||
|
||||
which depends on kernel configuration:
|
||||
|
||||
```C
|
||||
#define LOAD_PHYSICAL_ADDR ((CONFIG_PHYSICAL_START \
|
||||
+ (CONFIG_PHYSICAL_ALIGN - 1)) \
|
||||
& ~(CONFIG_PHYSICAL_ALIGN - 1))
|
||||
```
|
||||
|
||||
Now, as we considered parameters of the `choose_random_location` function, let's look at implementation of it. This function starts from the checking of `nokaslr` option in the kernel command line:
|
||||
|
||||
```C
|
||||
if (cmdline_find_option_bool("nokaslr")) {
|
||||
warn("KASLR disabled: 'nokaslr' on cmdline.");
|
||||
return;
|
||||
}
|
||||
```
|
||||
|
||||
and if the options was given we exit from the `choose_random_location` function ad kernel load address will not be randomized. Related command line options can be found in the [kernel documentation](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/kernel-parameters.txt):
|
||||
|
||||
```
|
||||
kaslr/nokaslr [X86]
|
||||
|
||||
Enable/disable kernel and module base offset ASLR
|
||||
(Address Space Layout Randomization) if built into
|
||||
the kernel. When CONFIG_HIBERNATION is selected,
|
||||
kASLR is disabled by default. When kASLR is enabled,
|
||||
hibernation will be disabled.
|
||||
```
|
||||
|
||||
Let's assume that we didn't pass `nokaslr` to the kernel command line and the `CONFIG_RANDOMIZE_BASE` kernel configuration option is enabled. In this case we add `kASLR` flag to kernel load flags:
|
||||
|
||||
```C
|
||||
boot_params->hdr.loadflags |= KASLR_FLAG;
|
||||
```
|
||||
|
||||
and the next step is the call of the:
|
||||
|
||||
```C
|
||||
initialize_identity_maps();
|
||||
```
|
||||
|
||||
function which is defined in the [arch/x86/boot/compressed/kaslr_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/kaslr_64.c) source code file. This function starts from initialization of `mapping_info` an instance of the `x86_mapping_info` structure:
|
||||
|
||||
```C
|
||||
mapping_info.alloc_pgt_page = alloc_pgt_page;
|
||||
mapping_info.context = &pgt_data;
|
||||
mapping_info.page_flag = __PAGE_KERNEL_LARGE_EXEC | sev_me_mask;
|
||||
mapping_info.kernpg_flag = _KERNPG_TABLE;
|
||||
```
|
||||
|
||||
The `x86_mapping_info` structure is defined in the [arch/x86/include/asm/init.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/init.h) header file and looks:
|
||||
|
||||
```C
|
||||
struct x86_mapping_info {
|
||||
void *(*alloc_pgt_page)(void *);
|
||||
void *context;
|
||||
unsigned long page_flag;
|
||||
unsigned long offset;
|
||||
bool direct_gbpages;
|
||||
unsigned long kernpg_flag;
|
||||
};
|
||||
```
|
||||
|
||||
This structure provides information about memory mappings. As you may remember from the previous part, we already setup'ed initial page tables from 0 up to `4G`. For now we may need to access memory above `4G` to load kernel at random position. So, the `initialize_identity_maps` function executes initialization of a memory region for a possible needed new page table. First of all let's try to look at the definition of the `x86_mapping_info` structure.
|
||||
|
||||
The `alloc_pgt_page` is a callback function that will be called to allocate space for a page table entry. The `context` field is an instance of the `alloc_pgt_data` structure in our case which will be used to track allocated page tables. The `page_flag` and `kernpg_flag` fields are page flags. The first represents flags for `PMD` or `PUD` entries. The second `kernpg_flag` field represents flags for kernel pages which can be overridden later. The `direct_gbpages` field represents support for huge pages and the last `offset` field represents offset between kernel virtual addresses and physical addresses up to `PMD` level.
|
||||
|
||||
The `alloc_pgt_page` callback just validates that there is space for a new page, allocates new page:
|
||||
|
||||
```C
|
||||
entry = pages->pgt_buf + pages->pgt_buf_offset;
|
||||
pages->pgt_buf_offset += PAGE_SIZE;
|
||||
```
|
||||
|
||||
in the buffer from the:
|
||||
|
||||
```C
|
||||
struct alloc_pgt_data {
|
||||
unsigned char *pgt_buf;
|
||||
unsigned long pgt_buf_size;
|
||||
unsigned long pgt_buf_offset;
|
||||
};
|
||||
```
|
||||
|
||||
structure and returns address of a new page. The last goal of the `initialize_identity_maps` function is to initialize `pgdt_buf_size` and `pgt_buf_offset`. As we are only in initialization phase, the `initialze_identity_maps` function sets `pgt_buf_offset` to zero:
|
||||
|
||||
```C
|
||||
pgt_data.pgt_buf_offset = 0;
|
||||
```
|
||||
|
||||
and the `pgt_data.pgt_buf_size` will be set to `77824` or `69632` depends on which boot protocol will be used by bootloader (64-bit or 32-bit). The same is for `pgt_data.pgt_buf`. If a bootloader loaded the kernel at `startup_32`, the `pgdt_data.pgdt_buf` will point to the end of the page table which already was initialzed in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S):
|
||||
|
||||
```C
|
||||
pgt_data.pgt_buf = _pgtable + BOOT_INIT_PGT_SIZE;
|
||||
```
|
||||
|
||||
where `_pgtable` points to the beginning of this page table [_pgtable](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/vmlinux.lds.S). In other way, if a bootloader have used 64-bit boot protocol and loaded the kernel at `startup_64`, early page tables should be built by bootloader itself and `_pgtable` will be just overwrote:
|
||||
|
||||
```C
|
||||
pgt_data.pgt_buf = _pgtable
|
||||
```
|
||||
|
||||
As the buffer for new page tables is initialized, we may return back to the `choose_random_location` function.
|
||||
|
||||
Avoid reserved memory ranges
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After the stuff related to identity page tables is initilized, we may start to choose random location where to put decompressed kernel image. But as you may guess, we can't choose any address. There are some reseved addresses in memory ranges. Such addresses occupied by important things, like [initrd](https://en.wikipedia.org/wiki/Initial_ramdisk), kernel command line and etc. The
|
||||
|
||||
```C
|
||||
mem_avoid_init(input, input_size, *output);
|
||||
```
|
||||
|
||||
function will help us to do this. All non-safe memory regions will be collected in the:
|
||||
|
||||
```C
|
||||
struct mem_vector {
|
||||
unsigned long long start;
|
||||
unsigned long long size;
|
||||
};
|
||||
|
||||
static struct mem_vector mem_avoid[MEM_AVOID_MAX];
|
||||
```
|
||||
|
||||
array. Where `MEM_AVOID_MAX` is from `mem_avoid_index` [enum](https://en.wikipedia.org/wiki/Enumerated_type#C) which represents different types of reserved memory regions:
|
||||
|
||||
```C
|
||||
enum mem_avoid_index {
|
||||
MEM_AVOID_ZO_RANGE = 0,
|
||||
MEM_AVOID_INITRD,
|
||||
MEM_AVOID_CMDLINE,
|
||||
MEM_AVOID_BOOTPARAMS,
|
||||
MEM_AVOID_MEMMAP_BEGIN,
|
||||
MEM_AVOID_MEMMAP_END = MEM_AVOID_MEMMAP_BEGIN + MAX_MEMMAP_REGIONS - 1,
|
||||
MEM_AVOID_MAX,
|
||||
};
|
||||
```
|
||||
|
||||
Both are defined in the [arch/x86/boot/compressed/kaslr.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/kaslr.c) source code file.
|
||||
|
||||
Let's look at the implementation of the `mem_avoid_init` function. The main goal of this function is to store information about reseved memory regions described by the `mem_avoid_index` enum in the `mem_avoid` array and create new pages for such regions in our new identity mapped buffer. Numerous parts fo the `mem_avoid_index` function are similar, but let's take a look at the one of them:
|
||||
|
||||
```C
|
||||
mem_avoid[MEM_AVOID_ZO_RANGE].start = input;
|
||||
mem_avoid[MEM_AVOID_ZO_RANGE].size = (output + init_size) - input;
|
||||
add_identity_map(mem_avoid[MEM_AVOID_ZO_RANGE].start,
|
||||
mem_avoid[MEM_AVOID_ZO_RANGE].size);
|
||||
```
|
||||
|
||||
At the beginning of the `mem_avoid_init` function tries to avoid memory region that is used for current kernel decompression. We fill an entry from the `mem_avoid` array with the start and size of such region and call the `add_identity_map` function which should build identity mapped pages for this region. The `add_identity_map` function is defined in the [arch/x86/boot/compressed/kaslr_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/kaslr_64.c) source code file and looks:
|
||||
|
||||
```C
|
||||
void add_identity_map(unsigned long start, unsigned long size)
|
||||
{
|
||||
unsigned long end = start + size;
|
||||
|
||||
start = round_down(start, PMD_SIZE);
|
||||
end = round_up(end, PMD_SIZE);
|
||||
if (start >= end)
|
||||
return;
|
||||
|
||||
kernel_ident_mapping_init(&mapping_info, (pgd_t *)top_level_pgt,
|
||||
start, end);
|
||||
}
|
||||
```
|
||||
|
||||
As you may see it aligns memory region to 2 megabytes boundary and checks given start and end addresses.
|
||||
|
||||
In the end it just calls the `kernel_ident_mapping_init` function from the [arch/x86/mm/ident_map.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/ident_map.c) source code file and pass `mapping_info` instance that was initilized above, address of the top level page table and addresses of memory region for which new identity mapping should be built.
|
||||
|
||||
The `kernel_ident_mapping_init` function sets default flags for new pages if they were not given:
|
||||
|
||||
```C
|
||||
if (!info->kernpg_flag)
|
||||
info->kernpg_flag = _KERNPG_TABLE;
|
||||
```
|
||||
|
||||
and starts to build new 2-megabytes (because of `PSE` bit in the `mapping_info.page_flag`) page entries (`PGD -> P4D -> PUD -> PMD` in a case of [five-level page tables](https://lwn.net/Articles/717293/) or `PGD -> PUD -> PMD` in a case of [four-level page tables](https://lwn.net/Articles/117749/)) related to the given addresses.
|
||||
|
||||
```C
|
||||
for (; addr < end; addr = next) {
|
||||
p4d_t *p4d;
|
||||
|
||||
next = (addr & PGDIR_MASK) + PGDIR_SIZE;
|
||||
if (next > end)
|
||||
next = end;
|
||||
|
||||
p4d = (p4d_t *)info->alloc_pgt_page(info->context);
|
||||
result = ident_p4d_init(info, p4d, addr, next);
|
||||
|
||||
return result;
|
||||
}
|
||||
```
|
||||
|
||||
First of all here we find next entry of the `Page Global Directory` for the given address and if it is greater than `end` of the given memory region, we set it to `end`. After this we allocater a new page with our `x86_mapping_info` callback that we already considered above and call the `ident_p4d_init` function. The `ident_p4d_init` function will do the same, but for low-level page directories (`p4d` -> `pud` -> `pmd`).
|
||||
|
||||
That's all.
|
||||
|
||||
New page entries related to reserved addresses are in our page tables. This is not the end of the `mem_avoid_init` function, but other parts are similar. It just build pages for [initrd](https://en.wikipedia.org/wiki/Initial_ramdisk), kernel command line and etc.
|
||||
|
||||
Now we may return back to `choose_random_location` function.
|
||||
|
||||
Physical address randomization
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After the reserved memory regions were stored in the `mem_avoid` array and identity mapping pages were built for them, we select minimal available address to choose random memory region to decompress the kernel:
|
||||
|
||||
```C
|
||||
min_addr = min(*output, 512UL << 20);
|
||||
```
|
||||
|
||||
As you may see it should be smaller than `512` megabytes. This `512` megabytes value was selected just to avoid unknown things in lower memory.
|
||||
|
||||
The next step is to select random physical and virtual addresses to load kernel. The first is physical addresses:
|
||||
|
||||
```C
|
||||
random_addr = find_random_phys_addr(min_addr, output_size);
|
||||
```
|
||||
|
||||
The `find_random_phys_addr` function is defined in the [same](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/kaslr.c) source code file:
|
||||
|
||||
```
|
||||
static unsigned long find_random_phys_addr(unsigned long minimum,
|
||||
unsigned long image_size)
|
||||
{
|
||||
minimum = ALIGN(minimum, CONFIG_PHYSICAL_ALIGN);
|
||||
|
||||
if (process_efi_entries(minimum, image_size))
|
||||
return slots_fetch_random();
|
||||
|
||||
process_e820_entries(minimum, image_size);
|
||||
return slots_fetch_random();
|
||||
}
|
||||
```
|
||||
|
||||
The main goal of `process_efi_entries` function is to find all suitable memory ranges in full accessible memory to load kernel. If the kernel compiled and runned on the system without [EFI](https://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface) support, we continue to search such memory regions in the [e820](https://en.wikipedia.org/wiki/E820) regions. All founded memory regions will be stored in the
|
||||
|
||||
```C
|
||||
struct slot_area {
|
||||
unsigned long addr;
|
||||
int num;
|
||||
};
|
||||
|
||||
#define MAX_SLOT_AREA 100
|
||||
|
||||
static struct slot_area slot_areas[MAX_SLOT_AREA];
|
||||
```
|
||||
|
||||
array. The kernel decompressor should select random index of this array and it will be random place where kernel will be decompressed. This selection will be executed by the `slots_fetch_random` function. The main goal of the `slots_fetch_random` function is to select random memory range from the `slot_areas` array via `kaslr_get_random_long` function:
|
||||
|
||||
```C
|
||||
slot = kaslr_get_random_long("Physical") % slot_max;
|
||||
```
|
||||
|
||||
The `kaslr_get_random_long` function is defined in the [arch/x86/lib/kaslr.c](https://github.com/torvalds/linux/blob/master/arch/x86/lib/kaslr.c) source code file and it just returns random number. Note that the random number will be get via different ways depends on kernel configuration and system opportunities (select random number base on [time stamp counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter), [rdrand](https://en.wikipedia.org/wiki/RdRand) and so on).
|
||||
|
||||
That's all from this point random memory range will be selected.
|
||||
|
||||
Virtual address randomization
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After random memory region was selected by the kernel decompressor, new identity mapped pages will be built for this region by demand:
|
||||
|
||||
```C
|
||||
random_addr = find_random_phys_addr(min_addr, output_size);
|
||||
|
||||
if (*output != random_addr) {
|
||||
add_identity_map(random_addr, output_size);
|
||||
*output = random_addr;
|
||||
}
|
||||
```
|
||||
|
||||
From this time `output` will store the base address of a memory region where kernel will be decompressed. But for this moment, as you may remember we randomized only physical address. Virtual address should be randomized too in a case of [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture:
|
||||
|
||||
```C
|
||||
if (IS_ENABLED(CONFIG_X86_64))
|
||||
random_addr = find_random_virt_addr(LOAD_PHYSICAL_ADDR, output_size);
|
||||
|
||||
*virt_addr = random_addr;
|
||||
```
|
||||
|
||||
As you may see in a case of non `x86_64` architecture, randomzed virtual address will coincide with randomized physical address. The `find_random_virt_addr` function calculates amount of virtual memory ranges that may hold kernel image and calls the `kaslr_get_random_long` that we already saw in a previous case when we tried to find random `physical` address.
|
||||
|
||||
From this moment we have both randomized base physical (`*output`) and virtual (`*virt_addr`) addresses for decompressed kernel.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the sixth and the last part about linux kernel booting process. We will not see posts about kernel booting anymore (maybe updates to this and previous posts), but there will be many posts about other kernel internals.
|
||||
|
||||
Next chapter will be about kernel initialization and we will see the first steps in the Linux kernel initialization code.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me in [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-internals).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [Address space layout randomization](https://en.wikipedia.org/wiki/Address_space_layout_randomization)
|
||||
* [Linux kernel boot protocol](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt)
|
||||
* [long mode](https://en.wikipedia.org/wiki/Long_mode)
|
||||
* [initrd](https://en.wikipedia.org/wiki/Initial_ramdisk)
|
||||
* [Enumerated type](https://en.wikipedia.org/wiki/Enumerated_type#C)
|
||||
* [four-level page tables](https://lwn.net/Articles/117749/)
|
||||
* [five-level page tables](https://lwn.net/Articles/717293/)
|
||||
* [EFI](https://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface)
|
||||
* [e820](https://en.wikipedia.org/wiki/E820)
|
||||
* [time stamp counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter)
|
||||
* [rdrand](https://en.wikipedia.org/wiki/RdRand)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [Previous part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-5.md)
|
@ -2,4 +2,4 @@
|
||||
|
||||
This chapter describes `control groups` mechanism in the Linux kernel.
|
||||
|
||||
* [Introduction](cgroups1.md)
|
||||
* [Introduction](linux-cgroups-1.md)
|
||||
|
@ -4,7 +4,7 @@ Control Groups
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the first part of the new chapter of the [linux insides](http://0xax.gitbooks.io/linux-insides/content/) book and as you may guess by part's name - this part will cover [control groups](https://en.wikipedia.org/wiki/Cgroups) or `cgroups` mechanism in the Linux kernel.
|
||||
This is the first part of the new chapter of the [linux insides](https://0xax.gitbooks.io/linux-insides/content/) book and as you may guess by part's name - this part will cover [control groups](https://en.wikipedia.org/wiki/Cgroups) or `cgroups` mechanism in the Linux kernel.
|
||||
|
||||
`Cgroups` are special mechanism provided by the Linux kernel which allows us to allocate kind of `resources` like processor time, number of processes per group, amount of memory per control group or combination of such resources for a process or set of processes. `Cgroups` are organized hierarchically and here this mechanism is similar to usual processes as they are hierarchical too and child `cgroups` inherit set of certain parameters from their parents. But actually they are not the same. The main differences between `cgroups` and normal processes that many different hierarchies of control groups may exist simultaneously in one time while normal process tree is always single. This was not a casual step because each control group hierarchy is attached to set of control group `subsystems`.
|
||||
|
@ -2,7 +2,7 @@
|
||||
|
||||
This chapter describes various concepts which are used in the Linux kernel.
|
||||
|
||||
* [Per-CPU variables](per-cpu.md)
|
||||
* [CPU masks](cpumask.md)
|
||||
* [The initcall mechanism](initcall.md)
|
||||
* [Notification Chains](notification_chains.md)
|
||||
* [Per-CPU variables](linux-cpu-1.md)
|
||||
* [CPU masks](linux-cpu-2.md)
|
||||
* [The initcall mechanism](linux-cpu-3.md)
|
||||
* [Notification Chains](linux-cpu-4.md)
|
@ -92,7 +92,7 @@ Where the `percpu_alloc_setup` function sets the `pcpu_chosen_fc` variable depen
|
||||
enum pcpu_fc pcpu_chosen_fc __initdata = PCPU_FC_AUTO;
|
||||
```
|
||||
|
||||
If the `percpu_alloc` parameter is not given to the kernel command line, the `embed` allocator will be used which embeds the first percpu chunk into bootmem with the [memblock](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-1.html). The last allocator is the first chunk `page` allocator which maps the first chunk with `PAGE_SIZE` pages.
|
||||
If the `percpu_alloc` parameter is not given to the kernel command line, the `embed` allocator will be used which embeds the first percpu chunk into bootmem with the [memblock](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-1.html). The last allocator is the first chunk `page` allocator which maps the first chunk with `PAGE_SIZE` pages.
|
||||
|
||||
As I wrote above, first of all we make a check of the first chunk allocator type in the `setup_per_cpu_areas`. We check that first chunk allocator is not page:
|
||||
|
@ -10,7 +10,7 @@ Introduction
|
||||
* [lib/cpumask.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/lib/cpumask.c)
|
||||
* [kernel/cpu.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/cpu.c)
|
||||
|
||||
As comment says from the [include/linux/cpumask.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/cpumask.h): Cpumasks provide a bitmap suitable for representing the set of CPU's in a system, one bit position per CPU number. We already saw a bit about cpumask in the `boot_cpu_init` function from the [Kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) part. This function makes first boot cpu online, active and etc...:
|
||||
As comment says from the [include/linux/cpumask.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/cpumask.h): Cpumasks provide a bitmap suitable for representing the set of CPU's in a system, one bit position per CPU number. We already saw a bit about cpumask in the `boot_cpu_init` function from the [Kernel entry point](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) part. This function makes first boot cpu online, active and etc...:
|
||||
|
||||
```C
|
||||
set_cpu_online(cpu, true);
|
@ -213,7 +213,7 @@ If you are interested, you can find these sections in the `arch/x86/kernel/vmlin
|
||||
}
|
||||
```
|
||||
|
||||
If you are not familiar with this then you can know more about [linkers](https://en.wikipedia.org/wiki/Linker_%28computing%29) in the special [part](https://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html) of this book.
|
||||
If you are not familiar with this then you can know more about [linkers](https://en.wikipedia.org/wiki/Linker_%28computing%29) in the special [part](https://0xax.gitbooks.io/linux-insides/content/Misc/linux-misc-3.html) of this book.
|
||||
|
||||
As we just saw, the `do_initcall_level` function takes one parameter - level of `initcall` and does following two things: First of all this function parses the `initcall_command_line` which is copy of usual kernel [command line](https://www.kernel.org/doc/Documentation/kernel-parameters.txt) which may contain parameters for modules with the `parse_args` function from the [kernel/params.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/params.c) source code file and call the `do_on_initcall` function for each level:
|
||||
|
||||
@ -387,9 +387,9 @@ Links
|
||||
* [symbols concatenation](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html)
|
||||
* [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)
|
||||
* [Link time optimization](https://gcc.gnu.org/wiki/LinkTimeOptimization)
|
||||
* [Introduction to linkers](https://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html)
|
||||
* [Introduction to linkers](https://0xax.gitbooks.io/linux-insides/content/Misc/linux-misc-3.html)
|
||||
* [Linux kernel command line](https://www.kernel.org/doc/Documentation/kernel-parameters.txt)
|
||||
* [Process identifier](https://en.wikipedia.org/wiki/Process_identifier)
|
||||
* [IRQs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
|
||||
* [rootfs](https://en.wikipedia.org/wiki/Initramfs)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html)
|
@ -76,7 +76,7 @@ In the first case for the `blocking notifier chains`, callbacks will be called/e
|
||||
|
||||
The second `SRCU notifier chains` represent alternative form of `blocking notifier chains`. In the first case, blocking notifier chains uses `rw_semaphore` synchronization primitive to protect chain links. `SRCU` notifier chains run in process context too, but uses special form of [RCU](https://en.wikipedia.org/wiki/Read-copy-update) mechanism which is permissible to block in an read-side critical section.
|
||||
|
||||
In the third case for the `atomic notifier chains` runs in interrupt or atomic context and protected by [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) synchronization primitive. The last `raw notifier chains` provides special type of notifier chains without any locking restrictions on callbacks. This means that protection rests on the shoulders of caller side. It is very useful when we want to protect our chain with very specific locking mechanism.
|
||||
In the third case for the `atomic notifier chains` runs in interrupt or atomic context and protected by [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html) synchronization primitive. The last `raw notifier chains` provides special type of notifier chains without any locking restrictions on callbacks. This means that protection rests on the shoulders of caller side. It is very useful when we want to protect our chain with very specific locking mechanism.
|
||||
|
||||
If we will look at the implementation of the `notifier_block` structure, we will see that it contains pointer to the `next` element from a notification chain list, but we have no head. Actually a head of such list is in separate structure depends on type of a notification chain. For example for the `blocking notifier chains`:
|
||||
|
||||
@ -118,7 +118,7 @@ which defines head for loadable modules blocking notifier chain. The `BLOCKING_N
|
||||
} while (0)
|
||||
```
|
||||
|
||||
So we may see that it takes name of a name of a head of a blocking notifier chain and initializes read/write [semaphore](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-3.html) and set head to `NULL`. Besides the `BLOCKING_INIT_NOTIFIER_HEAD` macro, the Linux kernel additionally provides `ATOMIC_INIT_NOTIFIER_HEAD`, `RAW_INIT_NOTIFIER_HEAD` macros and `srcu_init_notifier` function for initialization atomic and other types of notification chains.
|
||||
So we may see that it takes name of a name of a head of a blocking notifier chain and initializes read/write [semaphore](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-3.html) and set head to `NULL`. Besides the `BLOCKING_INIT_NOTIFIER_HEAD` macro, the Linux kernel additionally provides `ATOMIC_INIT_NOTIFIER_HEAD`, `RAW_INIT_NOTIFIER_HEAD` macros and `srcu_init_notifier` function for initialization atomic and other types of notification chains.
|
||||
|
||||
After initialization of a head of a notification chain, a subsystem which wants to receive notification from the given notification chain it should register with certain function which is depends on type of notification. If you will look in the [include/linux/notifier.h](https://github.com/torvalds/linux/blob/master/include/linux/notifier.h) header file, you will see following four function for this:
|
||||
|
||||
@ -331,7 +331,7 @@ static struct notifier_block tracepoint_module_nb = {
|
||||
};
|
||||
```
|
||||
|
||||
When one of the `MODULE_STATE_LIVE`, `MODULE_STATE_COMING` or `MODULE_STATE_GOING` events occurred. For example the `MODULE_STATE_LIVE` the `MODULE_STATE_COMING` notifications will be sent during execution of the [init_module](http://man7.org/linux/man-pages/man2/init_module.2.html) [system call](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html). Or for example `MODULE_STATE_GOING` will be sent during execution of the [delete_module](http://man7.org/linux/man-pages/man2/delete_module.2.html) `system call`:
|
||||
When one of the `MODULE_STATE_LIVE`, `MODULE_STATE_COMING` or `MODULE_STATE_GOING` events occurred. For example the `MODULE_STATE_LIVE` the `MODULE_STATE_COMING` notifications will be sent during execution of the [init_module](http://man7.org/linux/man-pages/man2/init_module.2.html) [system call](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-1.html). Or for example `MODULE_STATE_GOING` will be sent during execution of the [delete_module](http://man7.org/linux/man-pages/man2/delete_module.2.html) `system call`:
|
||||
|
||||
```C
|
||||
SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
|
||||
@ -359,11 +359,11 @@ Links
|
||||
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
* [callback](https://en.wikipedia.org/wiki/Callback_(computer_programming))
|
||||
* [RCU](https://en.wikipedia.org/wiki/Read-copy-update)
|
||||
* [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html)
|
||||
* [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html)
|
||||
* [loadable modules](https://en.wikipedia.org/wiki/Loadable_kernel_module)
|
||||
* [semaphore](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-3.html)
|
||||
* [semaphore](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-3.html)
|
||||
* [tracepoints](https://www.kernel.org/doc/Documentation/trace/tracepoints.txt)
|
||||
* [system call](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html)
|
||||
* [system call](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-1.html)
|
||||
* [init_module system call](http://man7.org/linux/man-pages/man2/init_module.2.html)
|
||||
* [delete_module](http://man7.org/linux/man-pages/man2/delete_module.2.html)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Concepts/initcall.html)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-3.html)
|
@ -5,6 +5,6 @@ Linux kernel provides different implementations of data structures like doubly l
|
||||
|
||||
This part considers the following data structures and algorithms:
|
||||
|
||||
* [Doubly linked list](dlist.md)
|
||||
* [Radix tree](radix-tree.md)
|
||||
* [Bit arrays](bitmap.md)
|
||||
* [Doubly linked list](linux-datastructures-1.md)
|
||||
* [Radix tree](linux-datastructures-2.md)
|
||||
* [Bit arrays](linux-datastructures-3.md)
|
||||
|
@ -13,7 +13,7 @@ Besides these two files, there is also architecture-specific header file which p
|
||||
|
||||
* [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/bitops.h)
|
||||
|
||||
header file. As I just wrote above, the `bitmap` is heavily used in the Linux kernel. For example a `bit array` is used to store set of online/offline processors for systems which support [hot-plug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) cpu (more about this you can read in the [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) part), a `bit array` stores set of allocated [irqs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) during initialization of the Linux kernel and etc.
|
||||
header file. As I just wrote above, the `bitmap` is heavily used in the Linux kernel. For example a `bit array` is used to store set of online/offline processors for systems which support [hot-plug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) cpu (more about this you can read in the [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) part), a `bit array` stores set of allocated [irqs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) during initialization of the Linux kernel and etc.
|
||||
|
||||
So, the main goal of this part is to see how `bit arrays` are implemented in the Linux kernel. Let's start.
|
||||
|
||||
@ -365,7 +365,7 @@ Links
|
||||
* [linked data structures](https://en.wikipedia.org/wiki/Linked_data_structure)
|
||||
* [tree data structures](https://en.wikipedia.org/wiki/Tree_%28data_structure%29)
|
||||
* [hot-plug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)
|
||||
* [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html)
|
||||
* [IRQs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
|
||||
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
* [atomic operations](https://en.wikipedia.org/wiki/Linearizability)
|
@ -4,9 +4,9 @@ Kernel initialization. Part 1.
|
||||
First steps in the kernel code
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The previous [post](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) was a last part of the Linux kernel [booting process](https://0xax.gitbooks.io/linux-insides/content/Booting/index.html) chapter and now we are starting to dive into initialization process of the Linux kernel. After the image of the Linux kernel is decompressed and placed in a correct place in memory, it starts to work. All previous parts describe the work of the Linux kernel setup code which does preparation before the first bytes of the Linux kernel code will be executed. From now we are in the kernel and all parts of this chapter will be devoted to the initialization process of the kernel before it will launch process with [pid](https://en.wikipedia.org/wiki/Process_identifier) `1`. There are many things to do before the kernel will start first `init` process. Hope we will see all of the preparations before kernel will start in this big chapter. We will start from the kernel entry point, which is located in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head_64.S) and will move further and further. We will see first preparations like early page tables initialization, switch to a new descriptor in kernel space and many many more, before we will see the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c#L489) will be called.
|
||||
The previous [post](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-6.html) was a last part of the Linux kernel [booting process](https://0xax.gitbooks.io/linux-insides/content/Booting/index.html) chapter and now we are starting to dive into initialization process of the Linux kernel. After the image of the Linux kernel is decompressed and placed in a correct place in memory, it starts to work. All previous parts describe the work of the Linux kernel setup code which does preparation before the first bytes of the Linux kernel code will be executed. From now we are in the kernel and all parts of this chapter will be devoted to the initialization process of the kernel before it will launch process with [pid](https://en.wikipedia.org/wiki/Process_identifier) `1`. There are many things to do before the kernel will start first `init` process. Hope we will see all of the preparations before kernel will start in this big chapter. We will start from the kernel entry point, which is located in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head_64.S) and will move further and further. We will see first preparations like early page tables initialization, switch to a new descriptor in kernel space and many many more, before we will see the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c#L489) will be called.
|
||||
|
||||
In the last [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) of the previous [chapter](https://0xax.gitbooks.io/linux-insides/content/Booting/index.html) we stopped at the [jmp](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S) instruction from the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S) assembly source code file:
|
||||
In the last [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-6.html) of the previous [chapter](https://0xax.gitbooks.io/linux-insides/content/Booting/index.html) we stopped at the [jmp](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S) instruction from the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S) assembly source code file:
|
||||
|
||||
```assembly
|
||||
jmp *%rax
|
||||
@ -88,7 +88,7 @@ After we got the address of the `startup_64`, we need to do a check that this ad
|
||||
jnz bad_address
|
||||
```
|
||||
|
||||
Here we just compare low part of the `rbp` register with the complemented value of the `PMD_PAGE_MASK`. The `PMD_PAGE_MASK` indicates the mask for `Page middle directory` (read [paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html) about it) and defined as:
|
||||
Here we just compare low part of the `rbp` register with the complemented value of the `PMD_PAGE_MASK`. The `PMD_PAGE_MASK` indicates the mask for `Page middle directory` (read [Paging](https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-1.html) about it) and defined as:
|
||||
|
||||
```C
|
||||
#define PMD_PAGE_MASK (~(PMD_PAGE_SIZE-1))
|
||||
@ -163,7 +163,7 @@ Looks hard, but it isn't. First of all let's look at the `early_level4_pgt`. It
|
||||
_PAGE_ACCESSED | _PAGE_DIRTY)
|
||||
```
|
||||
|
||||
You can read more about it in the [paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html) part.
|
||||
You can read more about it in the [Paging](https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-1.html) part.
|
||||
|
||||
The `level3_kernel_pgt` - stores two entries which map kernel space. At the start of it's definition, we can see that it is filled with zeros `L3_START_KERNEL` or `510` times. Here the `L3_START_KERNEL` is the index in the page upper directory which contains `__START_KERNEL_map` address and it equals `510`. After this, we can see the definition of the two `level3_kernel_pgt` entries: `level2_kernel_pgt` and `level2_fixmap_pgt`. First is simple, it is page table entry which contains pointer to the page middle directory which maps kernel space and it has:
|
||||
|
||||
@ -485,7 +485,7 @@ INIT_PER_CPU(gdt_page);
|
||||
|
||||
As we got `init_per_cpu__gdt_page` in `INIT_PER_CPU_VAR` and `INIT_PER_CPU` macro from linker script will be expanded we will get offset from the `__per_cpu_load`. After this calculations, we will have correct base address of the new GDT.
|
||||
|
||||
Generally per-CPU variables is a 2.6 kernel feature. You can understand what it is from its name. When we create `per-CPU` variable, each CPU will have its own copy of this variable. Here we creating `gdt_page` per-CPU variable. There are many advantages for variables of this type, like there are no locks, because each CPU works with its own copy of variable and etc... So every core on multiprocessor will have its own `GDT` table and every entry in the table will represent a memory segment which can be accessed from the thread which ran on the core. You can read in details about `per-CPU` variables in the [Theory/per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) post.
|
||||
Generally per-CPU variables is a 2.6 kernel feature. You can understand what it is from its name. When we create `per-CPU` variable, each CPU will have its own copy of this variable. Here we creating `gdt_page` per-CPU variable. There are many advantages for variables of this type, like there are no locks, because each CPU works with its own copy of variable and etc... So every core on multiprocessor will have its own `GDT` table and every entry in the table will represent a memory segment which can be accessed from the thread which ran on the core. You can read in details about `per-CPU` variables in the [Theory/per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) post.
|
||||
|
||||
As we loaded new Global Descriptor Table, we reload segments as we did it every time:
|
||||
|
||||
@ -614,7 +614,7 @@ Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [Model Specific Register](http://en.wikipedia.org/wiki/Model-specific_register)
|
||||
* [Paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html)
|
||||
* [Previous part - Kernel decompression](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html)
|
||||
* [Paging](https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-1.html)
|
||||
* [Previous part - kernel load address randomization](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-6.html)
|
||||
* [NX](http://en.wikipedia.org/wiki/NX_bit)
|
||||
* [ASLR](http://en.wikipedia.org/wiki/Address_space_layout_randomization)
|
||||
|
@ -4,7 +4,7 @@ Kernel initialization. Part 10.
|
||||
End of the linux kernel initialization process
|
||||
================================================================================
|
||||
|
||||
This is tenth part of the chapter about linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and in the [previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-9.html) we saw the initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update) and stopped on the call of the `acpi_early_init` function. This part will be the last part of the [Kernel initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) chapter, so let's finish it.
|
||||
This is tenth part of the chapter about linux kernel [initialization process](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and in the [previous part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-9.html) we saw the initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update) and stopped on the call of the `acpi_early_init` function. This part will be the last part of the [Kernel initialization process](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) chapter, so let's finish it.
|
||||
|
||||
After the call of the `acpi_early_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c), we can see the following code:
|
||||
|
||||
@ -185,7 +185,7 @@ nrpages = (nr_free_buffer_pages() * 10) / 100;
|
||||
max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head));
|
||||
```
|
||||
|
||||
which will be equal to the `10%` of the `ZONE_NORMAL` (all RAM from the 4GB on the `x86_64`). The next function after the `buffer_init` is - `vfs_caches_init`. This function allocates `SLAB` caches and hashtable for different [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) caches. We already saw the `vfs_caches_init_early` function in the eighth part of the linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-8.html) which initialized caches for `dcache` (or directory-cache) and [inode](http://en.wikipedia.org/wiki/Inode) cache. The `vfs_caches_init` function makes post-early initialization of the `dcache` and `inode` caches, private data cache, hash tables for the mount points, etc. More details about [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) will be described in the separate part. After this we can see `signals_init` function. This function is defined in the [kernel/signal.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/signal.c) and allocates a cache for the `sigqueue` structures which represents queue of the real time signals. The next function is `page_writeback_init`. This function initializes the ratio for the dirty pages. Every low-level page entry contains the `dirty` bit which indicates whether a page has been written to after been loaded into memory.
|
||||
which will be equal to the `10%` of the `ZONE_NORMAL` (all RAM from the 4GB on the `x86_64`). The next function after the `buffer_init` is - `vfs_caches_init`. This function allocates `SLAB` caches and hashtable for different [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) caches. We already saw the `vfs_caches_init_early` function in the eighth part of the linux kernel [initialization process](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-8.html) which initialized caches for `dcache` (or directory-cache) and [inode](http://en.wikipedia.org/wiki/Inode) cache. The `vfs_caches_init` function makes post-early initialization of the `dcache` and `inode` caches, private data cache, hash tables for the mount points, etc. More details about [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) will be described in the separate part. After this we can see `signals_init` function. This function is defined in the [kernel/signal.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/signal.c) and allocates a cache for the `sigqueue` structures which represents queue of the real time signals. The next function is `page_writeback_init`. This function initializes the ratio for the dirty pages. Every low-level page entry contains the `dirty` bit which indicates whether a page has been written to after been loaded into memory.
|
||||
|
||||
Creation of the root for the procfs
|
||||
--------------------------------------------------------------------------------
|
||||
@ -440,7 +440,7 @@ That's all! Linux kernel initialization process is finished!
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the tenth part about the linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html). It is not only the `tenth` part, but also is the last part which describes initialization of the linux kernel. As I wrote in the first [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of this chapter, we will go through all steps of the kernel initialization and we did it. We started at the first architecture-independent function - `start_kernel` and finished with the launch of the first `init` process in the our system. I skipped details about different subsystem of the kernel, for example I almost did not cover scheduler, interrupts, exception handling, etc. From the next part we will start to dive to the different kernel subsystems. Hope it will be interesting.
|
||||
It is the end of the tenth part about the linux kernel [initialization process](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html). It is not only the `tenth` part, but also is the last part which describes initialization of the linux kernel. As I wrote in the first [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of this chapter, we will go through all steps of the kernel initialization and we did it. We started at the first architecture-independent function - `start_kernel` and finished with the launch of the first `init` process in the our system. I skipped details about different subsystem of the kernel, for example I almost did not cover scheduler, interrupts, exception handling, etc. From the next part we will start to dive to the different kernel subsystems. Hope it will be interesting.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
@ -470,4 +470,4 @@ Links
|
||||
* [Tmpfs](http://en.wikipedia.org/wiki/Tmpfs)
|
||||
* [initrd](http://en.wikipedia.org/wiki/Initrd)
|
||||
* [panic](http://en.wikipedia.org/wiki/Kernel_panic)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-9.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-9.html)
|
||||
|
@ -4,9 +4,9 @@ Kernel initialization. Part 2.
|
||||
Early interrupt and exception handling
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) we stopped before setting of early interrupt handlers. At this moment we are in the decompressed Linux kernel, we have basic [paging](https://en.wikipedia.org/wiki/Page_table) structure for early boot and our current goal is to finish early preparation before the main kernel code will start to work.
|
||||
In the previous [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) we stopped before setting of early interrupt handlers. At this moment we are in the decompressed Linux kernel, we have basic [paging](https://en.wikipedia.org/wiki/Page_table) structure for early boot and our current goal is to finish early preparation before the main kernel code will start to work.
|
||||
|
||||
We already started to do this preparation in the previous [first](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) part of this [chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html). We continue in this part and will know more about interrupt and exception handling.
|
||||
We already started to do this preparation in the previous [first](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) part of this [chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html). We continue in this part and will know more about interrupt and exception handling.
|
||||
|
||||
Remember that we stopped before following loop:
|
||||
|
||||
@ -492,4 +492,4 @@ Links
|
||||
* [Page table](https://en.wikipedia.org/wiki/Page_table)
|
||||
* [Interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler)
|
||||
* [Page Fault](https://en.wikipedia.org/wiki/Page_fault),
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html)
|
||||
|
@ -76,7 +76,7 @@ After microcode was loaded we can see the check of the `console_loglevel` and th
|
||||
Move on init pages
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In the next step, as we have copied `boot_params` structure, we need to move from the early page tables to the page tables for initialization process. We already set early page tables for switchover, you can read about it in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) and dropped all it in the `reset_early_page_tables` function (you can read about it in the previous part too) and kept only kernel high mapping. After this we call:
|
||||
In the next step, as we have copied `boot_params` structure, we need to move from the early page tables to the page tables for initialization process. We already set early page tables for switchover, you can read about it in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) and dropped all it in the `reset_early_page_tables` function (you can read about it in the previous part too) and kept only kernel high mapping. After this we call:
|
||||
|
||||
```C
|
||||
clear_page(init_level4_pgt);
|
||||
|
@ -241,7 +241,7 @@ For now it is just zero. If the `CONFIG_DEBUG_PREEMPT` configuration option is d
|
||||
#define raw_smp_processor_id() (this_cpu_read(cpu_number))
|
||||
```
|
||||
|
||||
`this_cpu_read` as many other function like this (`this_cpu_write`, `this_cpu_add` and etc...) defined in the [include/linux/percpu-defs.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/percpu-defs.h) and presents `this_cpu` operation. These operations provide a way of optimizing access to the [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Theory/per-cpu.html) variables which are associated with the current processor. In our case it is `this_cpu_read`:
|
||||
`this_cpu_read` as many other function like this (`this_cpu_write`, `this_cpu_add` and etc...) defined in the [include/linux/percpu-defs.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/percpu-defs.h) and presents `this_cpu` operation. These operations provide a way of optimizing access to the [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) variables which are associated with the current processor. In our case it is `this_cpu_read`:
|
||||
|
||||
```
|
||||
__pcpu_size_call_return(this_cpu_read_, pcp)
|
||||
@ -346,7 +346,7 @@ static inline int __check_is_bitmap(const unsigned long *bitmap)
|
||||
|
||||
Yeah, it just returns `1` every time. Actually we need in it here only for one purpose: at compile time it checks that the given `bitmap` is a bitmap, or in other words it checks that the given `bitmap` has a type of `unsigned long *`. So we just pass `cpu_possible_bits` to the `to_cpumask` macro for converting the array of `unsigned long` to the `struct cpumask *`. Now we can call `cpumask_set_cpu` function with the `cpu` - 0 and `struct cpumask *cpu_possible_bits`. This function makes only one call of the `set_bit` function which sets the given `cpu` in the cpumask. All of these `set_cpu_*` functions work on the same principle.
|
||||
|
||||
If you're not sure that this `set_cpu_*` operations and `cpumask` are not clear for you, don't worry about it. You can get more info by reading the special part about it - [cpumask](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) or [documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt).
|
||||
If you're not sure that this `set_cpu_*` operations and `cpumask` are not clear for you, don't worry about it. You can get more info by reading the special part about it - [cpumask](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) or [documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt).
|
||||
|
||||
As we activated the bootstrap processor, it's time to go to the next function in the `start_kernel.` Now it is `page_address_init`, but this function does nothing in our case, because it executes only when all `RAM` can't be mapped directly.
|
||||
|
||||
@ -383,7 +383,7 @@ This function starts from the reserving memory block for the kernel `_text` and
|
||||
memblock_reserve(__pa_symbol(_text), (unsigned long)__bss_stop - (unsigned long)_text);
|
||||
```
|
||||
|
||||
You can read about `memblock` in the [Linux kernel memory management Part 1.](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-1.html). As you can remember `memblock_reserve` function takes two parameters:
|
||||
You can read about `memblock` in the [Linux kernel memory management Part 1.](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-1.html). As you can remember `memblock_reserve` function takes two parameters:
|
||||
|
||||
* base physical address of a memory block;
|
||||
* size of a memory block.
|
||||
@ -415,7 +415,7 @@ u64 ramdisk_size = get_ramdisk_size();
|
||||
u64 ramdisk_end = PAGE_ALIGN(ramdisk_image + ramdisk_size);
|
||||
```
|
||||
|
||||
All of these parameters are taken from `boot_params`. If you have read the chapter about [Linux Kernel Booting Process](http://0xax.gitbooks.io/linux-insides/content/Booting/index.html), you must remember that we filled the `boot_params` structure during boot time. The kernel setup header contains a couple of fields which describes ramdisk, for example:
|
||||
All of these parameters are taken from `boot_params`. If you have read the chapter about [Linux Kernel Booting Process](https://0xax.gitbooks.io/linux-insides/content/Booting/index.html), you must remember that we filled the `boot_params` structure during boot time. The kernel setup header contains a couple of fields which describes ramdisk, for example:
|
||||
|
||||
```
|
||||
Field name: ramdisk_image
|
||||
|
@ -4,7 +4,7 @@ Kernel initialization. Part 5.
|
||||
Continue of architecture-specific initialization
|
||||
================================================================================
|
||||
|
||||
In the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html), we stopped at the initialization of an architecture-specific stuff from the [setup_arch](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c#L856) function and now we will continue with it. As we reserved memory for the [initrd](http://en.wikipedia.org/wiki/Initrd), next step is the `olpc_ofw_detect` which detects [One Laptop Per Child support](http://wiki.laptop.org/go/OFW_FAQ). We will not consider platform related stuff in this book and will skip functions related with it. So let's go ahead. The next step is the `early_trap_init` function. This function initializes debug (`#DB` - raised when the `TF` flag of rflags is set) and `int3` (`#BP`) interrupts gate. If you don't know anything about interrupts, you can read about it in the [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html). In `x86` architecture `INT`, `INTO` and `INT3` are special instructions which allow a task to explicitly call an interrupt handler. The `INT3` instruction calls the breakpoint (`#BP`) handler. You may remember, we already saw it in the [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) about interrupts: and exceptions:
|
||||
In the previous [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html), we stopped at the initialization of an architecture-specific stuff from the [setup_arch](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c#L856) function and now we will continue with it. As we reserved memory for the [initrd](http://en.wikipedia.org/wiki/Initrd), next step is the `olpc_ofw_detect` which detects [One Laptop Per Child support](http://wiki.laptop.org/go/OFW_FAQ). We will not consider platform related stuff in this book and will skip functions related with it. So let's go ahead. The next step is the `early_trap_init` function. This function initializes debug (`#DB` - raised when the `TF` flag of rflags is set) and `int3` (`#BP`) interrupts gate. If you don't know anything about interrupts, you can read about it in the [Early interrupt and exception handling](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html). In `x86` architecture `INT`, `INTO` and `INT3` are special instructions which allow a task to explicitly call an interrupt handler. The `INT3` instruction calls the breakpoint (`#BP`) handler. You may remember, we already saw it in the [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) about interrupts: and exceptions:
|
||||
|
||||
```
|
||||
----------------------------------------------------------------------------------------------
|
||||
@ -163,7 +163,7 @@ The next step is initialization of early `ioremap`. In general there are two way
|
||||
* I/O Ports;
|
||||
* Device memory.
|
||||
|
||||
We already saw first method (`outb/inb` instructions) in the part about linux kernel booting [process](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html). The second method is to map I/O physical addresses to virtual addresses. When a physical address is accessed by the CPU, it may refer to a portion of physical RAM which can be mapped on memory of the I/O device. So `ioremap` used to map device memory into kernel address space.
|
||||
We already saw first method (`outb/inb` instructions) in the part about linux kernel booting [process](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html). The second method is to map I/O physical addresses to virtual addresses. When a physical address is accessed by the CPU, it may refer to a portion of physical RAM which can be mapped on memory of the I/O device. So `ioremap` used to map device memory into kernel address space.
|
||||
|
||||
As i wrote above next function is the `early_ioremap_init` which re-maps I/O memory to kernel address space so it can access it. We need to initialize early ioremap for early initialization code which needs to temporarily map I/O or memory regions before the normal mapping functions like `ioremap` are available. Implementation of this function is in the [arch/x86/mm/ioremap.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/mm/ioremap.c). At the start of the `early_ioremap_init` we can see definition of the `pmd` point with `pmd_t` type (which presents page middle directory entry `typedef struct { pmdval_t pmd; } pmd_t;` where `pmdval_t` is `unsigned long`) and make a check that `fixmap` aligned in a correct way:
|
||||
|
||||
@ -187,7 +187,7 @@ memset(bm_pte, 0, sizeof(bm_pte));
|
||||
pmd_populate_kernel(&init_mm, pmd, bm_pte);
|
||||
```
|
||||
|
||||
That's all for this. If you feeling puzzled, don't worry. There is special part about `ioremap` and `fixmaps` in the [Linux Kernel Memory Management. Part 2](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-2.md) chapter.
|
||||
That's all for this. If you feeling puzzled, don't worry. There is special part about `ioremap` and `fixmaps` in the [Linux Kernel Memory Management. Part 2](https://github.com/0xAX/linux-insides/blob/master/MM/linux-mm-2.md) chapter.
|
||||
|
||||
Obtaining major and minor numbers for the root device
|
||||
--------------------------------------------------------------------------------
|
||||
@ -235,7 +235,7 @@ After calculation we will get `0xfff` or 12 bits for `major` if it is `0xfffffff
|
||||
Memory map setup
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next point is the setup of the memory map with the call of the `setup_memory_map` function. But before this we setup different parameters as information about a screen (current row and column, video page and etc... (you can read about it in the [Video mode initialization and transition to protected mode](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html))), Extended display identification data, video mode, bootloader_type and etc...:
|
||||
The next point is the setup of the memory map with the call of the `setup_memory_map` function. But before this we setup different parameters as information about a screen (current row and column, video page and etc... (you can read about it in the [Video mode initialization and transition to protected mode](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html))), Extended display identification data, video mode, bootloader_type and etc...:
|
||||
|
||||
```C
|
||||
screen_info = boot_params.screen_info;
|
||||
@ -354,7 +354,7 @@ struct x86_init_ops x86_init __initdata = {
|
||||
}
|
||||
```
|
||||
|
||||
As we can see here `memry_setup` field is `default_machine_specific_memory_setup` where we get the number of the [e820](http://en.wikipedia.org/wiki/E820) entries which we collected in the [boot time](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html), sanitize the BIOS e820 map and fill `e820map` structure with the memory regions. As all regions are collected, print of all regions with printk. You can find this print if you execute `dmesg` command and you can see something like this:
|
||||
As we can see here `memry_setup` field is `default_machine_specific_memory_setup` where we get the number of the [e820](http://en.wikipedia.org/wiki/E820) entries which we collected in the [boot time](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html), sanitize the BIOS e820 map and fill `e820map` structure with the memory regions. As all regions are collected, print of all regions with printk. You can find this print if you execute `dmesg` command and you can see something like this:
|
||||
|
||||
```
|
||||
[ 0.000000] e820: BIOS-provided physical RAM map:
|
||||
@ -408,7 +408,7 @@ static inline void __init copy_edd(void)
|
||||
Memory descriptor initialization
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next step is initialization of the memory descriptor of the init process. As you already can know every process has its own address space. This address space presented with special data structure which called `memory descriptor`. Directly in the linux kernel source code memory descriptor presented with `mm_struct` structure. `mm_struct` contains many different fields related with the process address space as start/end address of the kernel code/data, start/end of the brk, number of memory areas, list of memory areas and etc... This structure defined in the [include/linux/mm_types.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/mm_types.h). As every process has its own memory descriptor, `task_struct` structure contains it in the `mm` and `active_mm` field. And our first `init` process has it too. You can remember that we saw the part of initialization of the init `task_struct` with `INIT_TASK` macro in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html):
|
||||
The next step is initialization of the memory descriptor of the init process. As you already can know every process has its own address space. This address space presented with special data structure which called `memory descriptor`. Directly in the linux kernel source code memory descriptor presented with `mm_struct` structure. `mm_struct` contains many different fields related with the process address space as start/end address of the kernel code/data, start/end of the brk, number of memory areas, list of memory areas and etc... This structure defined in the [include/linux/mm_types.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/mm_types.h). As every process has its own memory descriptor, `task_struct` structure contains it in the `mm` and `active_mm` field. And our first `init` process has it too. You can remember that we saw the part of initialization of the init `task_struct` with `INIT_TASK` macro in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html):
|
||||
|
||||
```C
|
||||
#define INIT_TASK(tsk) \
|
||||
@ -492,7 +492,7 @@ void x86_configure_nx(void)
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the fifth part about linux kernel initialization process. In this part we continued to dive in the `setup_arch` function which makes initialization of architecture-specific stuff. It was long part, but we have not finished with it. As i already wrote, the `setup_arch` is big function, and I am really not sure that we will cover all of it even in the next part. There were some new interesting concepts in this part like `Fix-mapped` addresses, ioremap and etc... Don't worry if they are unclear for you. There is a special part about these concepts - [Linux kernel memory management Part 2.](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-2.md). In the next part we will continue with the initialization of the architecture-specific stuff and will see parsing of the early kernel parameters, early dump of the pci devices, direct Media Interface scanning and many many more.
|
||||
It is the end of the fifth part about linux kernel initialization process. In this part we continued to dive in the `setup_arch` function which makes initialization of architecture-specific stuff. It was long part, but we have not finished with it. As i already wrote, the `setup_arch` is big function, and I am really not sure that we will cover all of it even in the next part. There were some new interesting concepts in this part like `Fix-mapped` addresses, ioremap and etc... Don't worry if they are unclear for you. There is a special part about these concepts - [Linux kernel memory management Part 2.](https://github.com/0xAX/linux-insides/blob/master/MM/linux-mm-2.md). In the next part we will continue with the initialization of the architecture-specific stuff and will see parsing of the early kernel parameters, early dump of the pci devices, `Desktop Management Interface` scanning and many many more.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
@ -511,4 +511,4 @@ Links
|
||||
* [CFI directives](https://sourceware.org/binutils/docs/as/CFI-directives.html)
|
||||
* [PDF. dwarf4 specification](http://dwarfstd.org/doc/DWARF4.pdf)
|
||||
* [Call stack](http://en.wikipedia.org/wiki/Call_stack)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html)
|
||||
|
@ -4,7 +4,7 @@ Kernel initialization. Part 6.
|
||||
Architecture-specific initialization, again...
|
||||
================================================================================
|
||||
|
||||
In the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) we saw architecture-specific (`x86_64` in our case) initialization stuff from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c) and finished on `x86_configure_nx` function which sets the `_PAGE_NX` flag depends on support of [NX bit](http://en.wikipedia.org/wiki/NX_bit). As I wrote before `setup_arch` function and `start_kernel` are very big, so in this and in the next part we will continue to learn about architecture-specific initialization process. The next function after `x86_configure_nx` is `parse_early_param`. This function is defined in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) and as you can understand from its name, this function parses kernel command line and setups different services depends on the given parameters (all kernel command line parameters you can find are in the [Documentation/kernel-parameters.txt](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/kernel-parameters.txt)). You may remember how we setup `earlyprintk` in the earliest [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html). On the early stage we looked for kernel parameters and their value with the `cmdline_find_option` function and `__cmdline_find_option`, `__cmdline_find_option_bool` helpers from the [arch/x86/boot/cmdline.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/cmdline.c). There we're in the generic kernel part which does not depend on architecture and here we use another approach. If you are reading linux kernel source code, you already note calls like this:
|
||||
In the previous [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) we saw architecture-specific (`x86_64` in our case) initialization stuff from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c) and finished on `x86_configure_nx` function which sets the `_PAGE_NX` flag depends on support of [NX bit](http://en.wikipedia.org/wiki/NX_bit). As I wrote before `setup_arch` function and `start_kernel` are very big, so in this and in the next part we will continue to learn about architecture-specific initialization process. The next function after `x86_configure_nx` is `parse_early_param`. This function is defined in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) and as you can understand from its name, this function parses kernel command line and setups different services depends on the given parameters (all kernel command line parameters you can find are in the [Documentation/kernel-parameters.txt](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/kernel-parameters.txt)). You may remember how we setup `earlyprintk` in the earliest [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html). On the early stage we looked for kernel parameters and their value with the `cmdline_find_option` function and `__cmdline_find_option`, `__cmdline_find_option_bool` helpers from the [arch/x86/boot/cmdline.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/cmdline.c). There we're in the generic kernel part which does not depend on architecture and here we use another approach. If you are reading linux kernel source code, you already note calls like this:
|
||||
|
||||
```C
|
||||
early_param("gbpages", parse_direct_gbpages_on);
|
||||
@ -97,7 +97,7 @@ After this we can see call of the:
|
||||
memblock_x86_reserve_range_setup_data();
|
||||
```
|
||||
|
||||
function. This function is defined in the same [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c) source code file and remaps memory for the `setup_data` and reserved memory block for the `setup_data` (more about `setup_data` you can read in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) and about `ioremap` and `memblock` you can read in the [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html)).
|
||||
function. This function is defined in the same [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c) source code file and remaps memory for the `setup_data` and reserved memory block for the `setup_data` (more about `setup_data` you can read in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) and about `ioremap` and `memblock` you can read in the [Linux kernel memory management](https://0xax.gitbooks.io/linux-insides/content/MM/index.html)).
|
||||
|
||||
In the next step we can see following conditional statement:
|
||||
|
||||
@ -128,7 +128,7 @@ int __init acpi_mps_check(void)
|
||||
}
|
||||
```
|
||||
|
||||
It checks the built-in `MPS` or [MultiProcessor Specification](http://en.wikipedia.org/wiki/MultiProcessor_Specification) table. If `CONFIG_X86_LOCAL_APIC` is set and `CONFIG_x86_MPPAARSE` is not set, `acpi_mps_check` prints warning message if the one of the command line options: `acpi=off`, `acpi=noirq` or `pci=noacpi` passed to the kernel. If `acpi_mps_check` returns `1` it means that we disable local [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) and clear `X86_FEATURE_APIC` bit in the of the current CPU with the `setup_clear_cpu_cap` macro. (more about CPU mask you can read in the [CPU masks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)).
|
||||
It checks the built-in `MPS` or [MultiProcessor Specification](http://en.wikipedia.org/wiki/MultiProcessor_Specification) table. If `CONFIG_X86_LOCAL_APIC` is set and `CONFIG_x86_MPPAARSE` is not set, `acpi_mps_check` prints warning message if the one of the command line options: `acpi=off`, `acpi=noirq` or `pci=noacpi` passed to the kernel. If `acpi_mps_check` returns `1` it means that we disable local [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) and clear `X86_FEATURE_APIC` bit in the of the current CPU with the `setup_clear_cpu_cap` macro. (more about CPU mask you can read in the [CPU masks](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html)).
|
||||
|
||||
Early PCI dump
|
||||
--------------------------------------------------------------------------------
|
||||
@ -193,7 +193,7 @@ That's all. We will not go deep in the `pci` details, but will see more details
|
||||
Finish with memory parsing
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After the `early_dump_pci_devices`, there are a couple of function related with available memory and [e820](http://en.wikipedia.org/wiki/E820) which we collected in the [First steps in the kernel setup](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) part:
|
||||
After the `early_dump_pci_devices`, there are a couple of function related with available memory and [e820](http://en.wikipedia.org/wiki/E820) which we collected in the [First steps in the kernel setup](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) part:
|
||||
|
||||
```C
|
||||
/* update the e820_saved too */
|
||||
@ -500,7 +500,7 @@ for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
|
||||
}
|
||||
```
|
||||
|
||||
After this we set the limit for the `memblock` allocation with the `memblock_set_current_limit` function (read more about `memblock` you can in the [Linux kernel memory management Part 2](https://github.com/0xAX/linux-insides/blob/master/mm/linux-mm-2.md)), it will be `ISA_END_ADDRESS` or `0x100000` and fill the `memblock` information according to `e820` with the call of the `memblock_x86_fill` function. You can see the result of this function in the kernel initialization time:
|
||||
After this we set the limit for the `memblock` allocation with the `memblock_set_current_limit` function (read more about `memblock` you can in the [Linux kernel memory management Part 2](https://github.com/0xAX/linux-insides/blob/master/MM/linux-mm-2.md)), it will be `ISA_END_ADDRESS` or `0x100000` and fill the `memblock` information according to `e820` with the call of the `memblock_x86_fill` function. You can see the result of this function in the kernel initialization time:
|
||||
|
||||
```
|
||||
MEMBLOCK configuration:
|
||||
@ -535,8 +535,8 @@ Links
|
||||
* [NX bit](http://en.wikipedia.org/wiki/NX_bit)
|
||||
* [Documentation/kernel-parameters.txt](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/kernel-parameters.txt)
|
||||
* [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)
|
||||
* [CPU masks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html)
|
||||
* [CPU masks](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html)
|
||||
* [Linux kernel memory management](https://0xax.gitbooks.io/linux-insides/content/MM/index.html)
|
||||
* [PCI](http://en.wikipedia.org/wiki/Conventional_PCI)
|
||||
* [e820](http://en.wikipedia.org/wiki/E820)
|
||||
* [System Management BIOS](http://en.wikipedia.org/wiki/System_Management_BIOS)
|
||||
@ -546,4 +546,4 @@ Links
|
||||
* [MultiProcessor Specification](http://www.intel.com/design/pentium/datashts/24201606.pdf)
|
||||
* [BSS](http://en.wikipedia.org/wiki/.bss)
|
||||
* [SMBIOS specification](http://www.dmtf.org/sites/default/files/standards/documents/DSP0134v2.5Final.pdf)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html)
|
||||
|
@ -4,7 +4,7 @@ Kernel initialization. Part 7.
|
||||
The End of the architecture-specific initialization, almost...
|
||||
================================================================================
|
||||
|
||||
This is the seventh part of the Linux Kernel initialization process which covers insides of the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c#L861). As you can know from the previous [parts](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html), the `setup_arch` function does some architecture-specific (in our case it is [x86_64](http://en.wikipedia.org/wiki/X86-64)) initialization stuff like reserving memory for kernel code/data/bss, early scanning of the [Desktop Management Interface](http://en.wikipedia.org/wiki/Desktop_Management_Interface), early dump of the [PCI](http://en.wikipedia.org/wiki/PCI) device and many many more. If you have read the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html), you can remember that we've finished it at the `setup_real_mode` function. In the next step, as we set limit of the [memblock](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-1.html) to the all mapped pages, we can see the call of the `setup_log_buf` function from the [kernel/printk/printk.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/printk/printk.c).
|
||||
This is the seventh part of the Linux Kernel initialization process which covers insides of the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c#L861). As you can know from the previous [parts](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html), the `setup_arch` function does some architecture-specific (in our case it is [x86_64](http://en.wikipedia.org/wiki/X86-64)) initialization stuff like reserving memory for kernel code/data/bss, early scanning of the [Desktop Management Interface](http://en.wikipedia.org/wiki/Desktop_Management_Interface), early dump of the [PCI](http://en.wikipedia.org/wiki/PCI) device and many many more. If you have read the previous [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html), you can remember that we've finished it at the `setup_real_mode` function. In the next step, as we set limit of the [memblock](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-1.html) to the all mapped pages, we can see the call of the `setup_log_buf` function from the [kernel/printk/printk.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/printk/printk.c).
|
||||
|
||||
The `setup_log_buf` function setups kernel cyclic buffer and its length depends on the `CONFIG_LOG_BUF_SHIFT` configuration option. As we can read from the documentation of the `CONFIG_LOG_BUF_SHIFT` it can be between `12` and `21`. In the insides, buffer defined as array of chars:
|
||||
|
||||
@ -32,7 +32,7 @@ setup_log_buf(1);
|
||||
|
||||
where `1` means that it is early setup. In the next step we check `new_log_buf_len` variable which is updated length of the kernel log buffer and allocate new space for the buffer with the `memblock_virt_alloc` function for it, or just return.
|
||||
|
||||
As kernel log buffer is ready, the next function is `reserve_initrd`. You can remember that we already called the `early_reserve_initrd` function in the fourth part of the [Kernel initialization](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html). Now, as we reconstructed direct memory mapping in the `init_mem_mapping` function, we need to move [initrd](http://en.wikipedia.org/wiki/Initrd) into directly mapped memory. The `reserve_initrd` function starts from the definition of the base address and end address of the `initrd` and check that `initrd` is provided by a bootloader. All the same as what we saw in the `early_reserve_initrd`. But instead of the reserving place in the `memblock` area with the call of the `memblock_reserve` function, we get the mapped size of the direct memory area and check that the size of the `initrd` is not greater than this area with:
|
||||
As kernel log buffer is ready, the next function is `reserve_initrd`. You can remember that we already called the `early_reserve_initrd` function in the fourth part of the [Kernel initialization](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html). Now, as we reconstructed direct memory mapping in the `init_mem_mapping` function, we need to move [initrd](http://en.wikipedia.org/wiki/Initrd) into directly mapped memory. The `reserve_initrd` function starts from the definition of the base address and end address of the `initrd` and check that `initrd` is provided by a bootloader. All the same as what we saw in the `early_reserve_initrd`. But instead of the reserving place in the `memblock` area with the call of the `memblock_reserve` function, we get the mapped size of the direct memory area and check that the size of the `initrd` is not greater than this area with:
|
||||
|
||||
```C
|
||||
mapped_size = memblock_mem_size(max_pfn_mapped);
|
||||
@ -68,7 +68,7 @@ memblock_free(ramdisk_image, ramdisk_end - ramdisk_image);
|
||||
|
||||
After we relocated `initrd` ramdisk image, the next function is `vsmp_init` from the [arch/x86/kernel/vsmp_64.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/vsmp_64.c). This function initializes support of the `ScaleMP vSMP`. As I already wrote in the previous parts, this chapter will not cover non-related `x86_64` initialization parts (for example as the current or `ACPI`, etc.). So we will skip implementation of this for now and will back to it in the part which cover techniques of parallel computing.
|
||||
|
||||
The next function is `io_delay_init` from the [arch/x86/kernel/io_delay.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/io_delay.c). This function allows to override default I/O delay `0x80` port. We already saw I/O delay in the [Last preparation before transition into protected mode](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html), now let's look on the `io_delay_init` implementation:
|
||||
The next function is `io_delay_init` from the [arch/x86/kernel/io_delay.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/io_delay.c). This function allows to override default I/O delay `0x80` port. We already saw I/O delay in the [Last preparation before transition into protected mode](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html), now let's look on the `io_delay_init` implementation:
|
||||
|
||||
```C
|
||||
void __init io_delay_init(void)
|
||||
@ -98,7 +98,7 @@ We can see `io_delay` command line parameter setup with the `early_param` macro
|
||||
early_param("io_delay", io_delay_param);
|
||||
```
|
||||
|
||||
More about `early_param` you can read in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html). So the `io_delay_param` function which setups `io_delay_override` variable will be called in the [do_early_param](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c#L413) function. `io_delay_param` function gets the argument of the `io_delay` kernel command line parameter and sets `io_delay_type` depends on it:
|
||||
More about `early_param` you can read in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html). So the `io_delay_param` function which setups `io_delay_override` variable will be called in the [do_early_param](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c#L413) function. `io_delay_param` function gets the argument of the `io_delay` kernel command line parameter and sets `io_delay_type` depends on it:
|
||||
|
||||
```C
|
||||
static int __init io_delay_param(char *s)
|
||||
@ -296,19 +296,19 @@ BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
|
||||
(unsigned long)VSYSCALL_ADDR);
|
||||
```
|
||||
|
||||
Now `vsyscall` area is in the `fix-mapped` area. That's all about `map_vsyscall`, if you do not know anything about fix-mapped addresses, you can read [Fix-Mapped Addresses and ioremap](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html). We will see more about `vsyscalls` in the `vsyscalls and vdso` part.
|
||||
Now `vsyscall` area is in the `fix-mapped` area. That's all about `map_vsyscall`, if you do not know anything about fix-mapped addresses, you can read [Fix-Mapped Addresses and ioremap](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-2.html). We will see more about `vsyscalls` in the `vsyscalls and vdso` part.
|
||||
|
||||
Getting the SMP configuration
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
You may remember how we made a search of the [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing) configuration in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html). Now we need to get the `SMP` configuration if we found it. For this we check `smp_found_config` variable which we set in the `smp_scan_config` function (read about it the previous part) and call the `get_smp_config` function:
|
||||
You may remember how we made a search of the [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing) configuration in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html). Now we need to get the `SMP` configuration if we found it. For this we check `smp_found_config` variable which we set in the `smp_scan_config` function (read about it the previous part) and call the `get_smp_config` function:
|
||||
|
||||
```C
|
||||
if (smp_found_config)
|
||||
get_smp_config();
|
||||
```
|
||||
|
||||
The `get_smp_config` expands to the `x86_init.mpparse.default_get_smp_config` function which is defined in the [arch/x86/kernel/mpparse.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/mpparse.c). This function defines a pointer to the multiprocessor floating pointer structure - `mpf_intel` (you can read about it in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html)) and does some checks:
|
||||
The `get_smp_config` expands to the `x86_init.mpparse.default_get_smp_config` function which is defined in the [arch/x86/kernel/mpparse.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/mpparse.c). This function defines a pointer to the multiprocessor floating pointer structure - `mpf_intel` (you can read about it in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html)) and does some checks:
|
||||
|
||||
```C
|
||||
struct mpf_intel *mpf = mpf_found;
|
||||
@ -320,7 +320,7 @@ if (acpi_lapic && early)
|
||||
return;
|
||||
```
|
||||
|
||||
Here we can see that multiprocessor configuration was found in the `smp_scan_config` function or just return from the function if not. The next check is `acpi_lapic` and `early`. And as we did this checks, we start to read the `SMP` configuration. As we finished reading it, the next step is - `prefill_possible_map` function which makes preliminary filling of the possible CPU's `cpumask` (more about it you can read in the [Introduction to the cpumasks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)).
|
||||
Here we can see that multiprocessor configuration was found in the `smp_scan_config` function or just return from the function if not. The next check is `acpi_lapic` and `early`. And as we did this checks, we start to read the `SMP` configuration. As we finished reading it, the next step is - `prefill_possible_map` function which makes preliminary filling of the possible CPU's `cpumask` (more about it you can read in the [Introduction to the cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html)).
|
||||
|
||||
The rest of the setup_arch
|
||||
--------------------------------------------------------------------------------
|
||||
@ -334,7 +334,7 @@ That's all, and now we can back to the `start_kernel` from the `setup_arch`.
|
||||
Back to the main.c
|
||||
================================================================================
|
||||
|
||||
As I wrote above, we have finished with the `setup_arch` function and now we can back to the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c). As you may remember or saw yourself, `start_kernel` function as big as the `setup_arch`. So the couple of the next part will be dedicated to learning of this function. So, let's continue with it. After the `setup_arch` we can see the call of the `mm_init_cpumask` function. This function sets the [cpumask](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) pointer to the memory descriptor `cpumask`. We can look on its implementation:
|
||||
As I wrote above, we have finished with the `setup_arch` function and now we can back to the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c). As you may remember or saw yourself, `start_kernel` function as big as the `setup_arch`. So the couple of the next part will be dedicated to learning of this function. So, let's continue with it. After the `setup_arch` we can see the call of the `mm_init_cpumask` function. This function sets the [cpumask](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) pointer to the memory descriptor `cpumask`. We can look on its implementation:
|
||||
|
||||
```C
|
||||
static inline void mm_init_cpumask(struct mm_struct *mm)
|
||||
@ -379,7 +379,7 @@ static void __init setup_command_line(char *command_line)
|
||||
|
||||
Here we can see that we allocate space for the three buffers which will contain kernel command line for the different purposes (read above). And as we allocated space, we store `boot_command_line` in the `saved_command_line` and `command_line` (kernel command line from the `setup_arch`) to the `static_command_line`.
|
||||
|
||||
The next function after the `setup_command_line` is the `setup_nr_cpu_ids`. This function setting `nr_cpu_ids` (number of CPUs) according to the last bit in the `cpu_possible_mask` (more about it you can read in the chapter describes [cpumasks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) concept). Let's look on its implementation:
|
||||
The next function after the `setup_command_line` is the `setup_nr_cpu_ids`. This function setting `nr_cpu_ids` (number of CPUs) according to the last bit in the `cpu_possible_mask` (more about it you can read in the chapter describes [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) concept). Let's look on its implementation:
|
||||
|
||||
```C
|
||||
void __init setup_nr_cpu_ids(void)
|
||||
@ -479,4 +479,4 @@ Links
|
||||
* [vsyscalls](https://lwn.net/Articles/446528/)
|
||||
* [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing)
|
||||
* [jiffy](http://en.wikipedia.org/wiki/Jiffy_%28time%29)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/%20linux-initialization-6.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html)
|
||||
|
@ -4,9 +4,9 @@ Kernel initialization. Part 8.
|
||||
Scheduler initialization
|
||||
================================================================================
|
||||
|
||||
This is the eighth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) of the Linux kernel initialization process chapter and we stopped on the `setup_nr_cpu_ids` function in the [previous part](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-7.md).
|
||||
This is the eighth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) of the Linux kernel initialization process chapter and we stopped on the `setup_nr_cpu_ids` function in the [previous part](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-7.md).
|
||||
|
||||
The main point of this part is [scheduler](http://en.wikipedia.org/wiki/Scheduling_%28computing%29) initialization. But before we will start to learn initialization process of the scheduler, we need to do some stuff. The next step in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) is the `setup_per_cpu_areas` function. This function setups memory areas for the `percpu` variables, more about it you can read in the special part about the [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html). After `percpu` areas is up and running, the next step is the `smp_prepare_boot_cpu` function.
|
||||
The main point of this part is [scheduler](http://en.wikipedia.org/wiki/Scheduling_%28computing%29) initialization. But before we will start to learn initialization process of the scheduler, we need to do some stuff. The next step in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) is the `setup_per_cpu_areas` function. This function setups memory areas for the `percpu` variables, more about it you can read in the special part about the [Per-CPU variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html). After `percpu` areas is up and running, the next step is the `smp_prepare_boot_cpu` function.
|
||||
|
||||
This function does some preparations for [symmetric multiprocessing](http://en.wikipedia.org/wiki/Symmetric_multiprocessing). Since this function is architecture specific, it is located in the [arch/x86/include/asm/smp.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/smp.h#L78) Linux kernel header file. Let's look at the definition of this function:
|
||||
|
||||
@ -44,7 +44,7 @@ void __init native_smp_prepare_boot_cpu(void)
|
||||
}
|
||||
```
|
||||
|
||||
and executes following things: first of all it gets the `id` of the current CPU (which is Bootstrap processor and its `id` is zero for this moment) with the `smp_processor_id` function. I will not explain how the `smp_processor_id` works, because we already saw it in the [Kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) part. After we've got processor `id` number we reload [Global Descriptor Table](http://en.wikipedia.org/wiki/Global_Descriptor_Table) for the given CPU with the `switch_to_new_gdt` function:
|
||||
and executes following things: first of all it gets the `id` of the current CPU (which is Bootstrap processor and its `id` is zero for this moment) with the `smp_processor_id` function. I will not explain how the `smp_processor_id` works, because we already saw it in the [Kernel entry point](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) part. After we've got processor `id` number we reload [Global Descriptor Table](http://en.wikipedia.org/wiki/Global_Descriptor_Table) for the given CPU with the `switch_to_new_gdt` function:
|
||||
|
||||
```C
|
||||
void switch_to_new_gdt(int cpu)
|
||||
@ -58,7 +58,7 @@ void switch_to_new_gdt(int cpu)
|
||||
}
|
||||
```
|
||||
|
||||
The `gdt_descr` variable represents pointer to the `GDT` descriptor here (we already saw definition of a `desc_ptr` structure in the [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) part). We get the address and the size of the `GDT` descriptor for the `CPU` with the given `id`. The `GDT_SIZE` is `256` or:
|
||||
The `gdt_descr` variable represents pointer to the `GDT` descriptor here (we already saw definition of a `desc_ptr` structure in the [Early interrupt and exception handling](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) part). We get the address and the size of the `GDT` descriptor for the `CPU` with the given `id`. The `GDT_SIZE` is `256` or:
|
||||
|
||||
```C
|
||||
#define GDT_SIZE (GDT_ENTRIES * 8)
|
||||
@ -75,7 +75,7 @@ static inline struct desc_struct *get_cpu_gdt_table(unsigned int cpu)
|
||||
|
||||
The `get_cpu_gdt_table` uses `per_cpu` macro for getting value of a `gdt_page` percpu variable for the given CPU number (bootstrap processor with `id` - 0 in our case).
|
||||
|
||||
You may ask the following question: so, if we can access `gdt_page` percpu variable, where it was defined? Actually we already saw it in this book. If you have read the first [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of this chapter, you can remember that we saw definition of the `gdt_page` in the [arch/x86/kernel/head_64.S](https://github.com/0xAX/linux/blob/0a07b238e5f488b459b6113a62e06b6aab017f71/arch/x86/kernel/head_64.S):
|
||||
You may ask the following question: so, if we can access `gdt_page` percpu variable, where it was defined? Actually we already saw it in this book. If you have read the first [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of this chapter, you can remember that we saw definition of the `gdt_page` in the [arch/x86/kernel/head_64.S](https://github.com/0xAX/linux/blob/0a07b238e5f488b459b6113a62e06b6aab017f71/arch/x86/kernel/head_64.S):
|
||||
|
||||
```assembly
|
||||
early_gdt_descr:
|
||||
@ -107,7 +107,7 @@ DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
|
||||
...
|
||||
```
|
||||
|
||||
more about `percpu` variables you can read in the [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) part. As we got address and size of the `GDT` descriptor we reload `GDT` with the `load_gdt` which just execute `lgdt` instruct and load `percpu_segment` with the following function:
|
||||
more about `percpu` variables you can read in the [Per-CPU variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) part. As we got address and size of the `GDT` descriptor we reload `GDT` with the `load_gdt` which just execute `lgdt` instruct and load `percpu_segment` with the following function:
|
||||
|
||||
```C
|
||||
void load_percpu_segment(int cpu) {
|
||||
@ -205,11 +205,11 @@ After this function we can see the kernel command line in the initialization out
|
||||
|
||||
![kernel command line](http://oi58.tinypic.com/2m7vz10.jpg)
|
||||
|
||||
And a couple of functions such as `parse_early_param` and `parse_args` which handles linux kernel command line. You may remember that we already saw the call of the `parse_early_param` function in the sixth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the kernel initialization chapter, so why we call it again? Answer is simple: we call this function in the architecture-specific code (`x86_64` in our case), but not all architecture calls this function. And we need to call the second function `parse_args` to parse and handle non-early command line arguments.
|
||||
And a couple of functions such as `parse_early_param` and `parse_args` which handles linux kernel command line. You may remember that we already saw the call of the `parse_early_param` function in the sixth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the kernel initialization chapter, so why we call it again? Answer is simple: we call this function in the architecture-specific code (`x86_64` in our case), but not all architecture calls this function. And we need to call the second function `parse_args` to parse and handle non-early command line arguments.
|
||||
|
||||
In the next step we can see the call of the `jump_label_init` from the [kernel/jump_label.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/jump_label.c). and initializes [jump label](https://lwn.net/Articles/412072/).
|
||||
|
||||
After this we can see the call of the `setup_log_buf` function which setups the [printk](http://www.makelinux.net/books/lkd2/ch18lev1sec3) log buffer. We already saw this function in the seventh [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-7.html) of the linux kernel initialization process chapter.
|
||||
After this we can see the call of the `setup_log_buf` function which setups the [printk](http://www.makelinux.net/books/lkd2/ch18lev1sec3) log buffer. We already saw this function in the seventh [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-7.html) of the linux kernel initialization process chapter.
|
||||
|
||||
PID hash initialization
|
||||
--------------------------------------------------------------------------------
|
||||
@ -230,7 +230,7 @@ pid_hash = alloc_large_system_hash("PID", sizeof(*pid_hash), 0, 18,
|
||||
```
|
||||
|
||||
The number of elements of the `pid_hash` depends on the `RAM` configuration, but it can be between `2^4` and `2^12`. The `pidhash_init` computes the size
|
||||
and allocates the required storage (which is `hlist` in our case - the same as [doubly linked list](http://0xax.gitbooks.io/linux-insides/content/DataStructures/dlist.html), but contains one pointer instead on the [struct hlist_head](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/types.h)]. The `alloc_large_system_hash` function allocates a large system hash table with `memblock_virt_alloc_nopanic` if we pass `HASH_EARLY` flag (as it in our case) or with `__vmalloc` if we did no pass this flag.
|
||||
and allocates the required storage (which is `hlist` in our case - the same as [doubly linked list](https://0xax.gitbooks.io/linux-insides/content/DataStructures/linux-datastructures-1.html), but contains one pointer instead on the [struct hlist_head](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/types.h)]. The `alloc_large_system_hash` function allocates a large system hash table with `memblock_virt_alloc_nopanic` if we pass `HASH_EARLY` flag (as it in our case) or with `__vmalloc` if we did no pass this flag.
|
||||
|
||||
The result we can see in the `dmesg` output:
|
||||
|
||||
@ -255,7 +255,7 @@ pgtable_init();
|
||||
vmalloc_init();
|
||||
```
|
||||
|
||||
The first is `page_ext_init_flatmem` which depends on the `CONFIG_SPARSEMEM` kernel configuration option and initializes extended data per page handling. The `mem_init` releases all `bootmem`, the `kmem_cache_init` initializes kernel cache, the `percpu_init_late` - replaces `percpu` chunks with those allocated by [slub](http://en.wikipedia.org/wiki/SLUB_%28software%29), the `pgtable_init` - initializes the `page->ptl` kernel cache, the `vmalloc_init` - initializes `vmalloc`. Please, **NOTE** that we will not dive into details about all of these functions and concepts, but we will see all of they it in the [Linux kernel memory manager](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) chapter.
|
||||
The first is `page_ext_init_flatmem` which depends on the `CONFIG_SPARSEMEM` kernel configuration option and initializes extended data per page handling. The `mem_init` releases all `bootmem`, the `kmem_cache_init` initializes kernel cache, the `percpu_init_late` - replaces `percpu` chunks with those allocated by [slub](http://en.wikipedia.org/wiki/SLUB_%28software%29), the `pgtable_init` - initializes the `page->ptl` kernel cache, the `vmalloc_init` - initializes `vmalloc`. Please, **NOTE** that we will not dive into details about all of these functions and concepts, but we will see all of they it in the [Linux kernel memory manager](https://0xax.gitbooks.io/linux-insides/content/MM/index.html) chapter.
|
||||
|
||||
That's all. Now we can look on the `scheduler`.
|
||||
|
||||
@ -533,7 +533,7 @@ __init void init_sched_fair_class(void)
|
||||
}
|
||||
```
|
||||
|
||||
Here we register a [soft irq](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html) that will call the `run_rebalance_domains` handler. After the `SCHED_SOFTIRQ` will be triggered, the `run_rebalance` will be called to rebalance a run queue on the current CPU.
|
||||
Here we register a [soft irq](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-9.html) that will call the `run_rebalance_domains` handler. After the `SCHED_SOFTIRQ` will be triggered, the `run_rebalance` will be called to rebalance a run queue on the current CPU.
|
||||
|
||||
The last two steps of the `sched_init` function is to initialization of scheduler statistics and setting `scheeduler_running` variable:
|
||||
|
||||
@ -555,19 +555,19 @@ If you have any questions or suggestions write me a comment or ping me at [twitt
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [CPU masks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [CPU masks](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html)
|
||||
* [high-resolution kernel timer](https://www.kernel.org/doc/Documentation/timers/hrtimers.txt)
|
||||
* [spinlock](http://en.wikipedia.org/wiki/Spinlock)
|
||||
* [Run queue](http://en.wikipedia.org/wiki/Run_queue)
|
||||
* [Linux kernel memory manager](http://0xax.gitbooks.io/linux-insides/content/mm/index.html)
|
||||
* [Linux kernel memory manager](https://0xax.gitbooks.io/linux-insides/content/MM/index.html)
|
||||
* [slub](http://en.wikipedia.org/wiki/SLUB_%28software%29)
|
||||
* [virtual file system](http://en.wikipedia.org/wiki/Virtual_file_system)
|
||||
* [Linux kernel hotplug documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)
|
||||
* [IRQ](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
|
||||
* [Global Descriptor Table](http://en.wikipedia.org/wiki/Global_Descriptor_Table)
|
||||
* [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [Per-CPU variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html)
|
||||
* [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing)
|
||||
* [RCU](http://en.wikipedia.org/wiki/Read-copy-update)
|
||||
* [CFS Scheduler documentation](https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt)
|
||||
* [Real-Time group scheduling](https://www.kernel.org/doc/Documentation/scheduler/sched-rt-group.txt)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-7.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-7.html)
|
||||
|
@ -4,7 +4,7 @@ Kernel initialization. Part 9.
|
||||
RCU initialization
|
||||
================================================================================
|
||||
|
||||
This is ninth part of the [Linux Kernel initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and in the previous part we stopped at the [scheduler initialization](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-8.html). In this part we will continue to dive to the linux kernel initialization process and the main purpose of this part will be to learn about initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update). We can see that the next step in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) after the `sched_init` is the call of the `preempt_disable`. There are two macros:
|
||||
This is ninth part of the [Linux Kernel initialization process](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and in the previous part we stopped at the [scheduler initialization](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-8.html). In this part we will continue to dive to the linux kernel initialization process and the main purpose of this part will be to learn about initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update). We can see that the next step in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) after the `sched_init` is the call of the `preempt_disable`. There are two macros:
|
||||
|
||||
* `preempt_disable`
|
||||
* `preempt_enable`
|
||||
@ -38,7 +38,7 @@ In the first implementation of the `preempt_disable` we increment this `__preemp
|
||||
#define preempt_count_add(val) __preempt_count_add(val)
|
||||
```
|
||||
|
||||
where `preempt_count_add` calls the `raw_cpu_add_4` macro which adds `1` to the given `percpu` variable (`__preempt_count`) in our case (more about `precpu` variables you can read in the part about [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)). Ok, we increased `__preempt_count` and the next step we can see the call of the `barrier` macro in the both macros. The `barrier` macro inserts an optimization barrier. In the processors with `x86_64` architecture independent memory access operations can be performed in any order. That's why we need the opportunity to point compiler and processor on compliance of order. This mechanism is memory barrier. Let's consider a simple example:
|
||||
where `preempt_count_add` calls the `raw_cpu_add_4` macro which adds `1` to the given `percpu` variable (`__preempt_count`) in our case (more about `precpu` variables you can read in the part about [Per-CPU variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html)). Ok, we increased `__preempt_count` and the next step we can see the call of the `barrier` macro in the both macros. The `barrier` macro inserts an optimization barrier. In the processors with `x86_64` architecture independent memory access operations can be performed in any order. That's why we need the opportunity to point compiler and processor on compliance of order. This mechanism is memory barrier. Let's consider a simple example:
|
||||
|
||||
```C
|
||||
preempt_disable();
|
||||
@ -83,7 +83,7 @@ void __init idr_init_cache(void)
|
||||
}
|
||||
```
|
||||
|
||||
Here we can see the call of the `kmem_cache_create`. We already called the `kmem_cache_init` in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c#L485). This function create generalized caches again using the `kmem_cache_alloc` (more about caches we will see in the [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) chapter). In our case, as we are using `kmem_cache_t` which will be used by the [slab](http://en.wikipedia.org/wiki/Slab_allocation) allocator and `kmem_cache_create` creates it. As you can see we pass five parameters to the `kmem_cache_create`:
|
||||
Here we can see the call of the `kmem_cache_create`. We already called the `kmem_cache_init` in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c#L485). This function create generalized caches again using the `kmem_cache_alloc` (more about caches we will see in the [Linux kernel memory management](https://0xax.gitbooks.io/linux-insides/content/MM/index.html) chapter). In our case, as we are using `kmem_cache_t` which will be used by the [slab](http://en.wikipedia.org/wiki/Slab_allocation) allocator and `kmem_cache_create` creates it. As you can see we pass five parameters to the `kmem_cache_create`:
|
||||
|
||||
* name of the cache;
|
||||
* size of the object to store in cache;
|
||||
@ -127,7 +127,7 @@ The next step is [RCU](http://en.wikipedia.org/wiki/Read-copy-update) initializa
|
||||
|
||||
In the first case `rcu_init` will be in the [kernel/rcu/tiny.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/rcu/tiny.c) and in the second case it will be defined in the [kernel/rcu/tree.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/rcu/tree.c). We will see the implementation of the `tree rcu`, but first of all about the `RCU` in general.
|
||||
|
||||
`RCU` or read-copy update is a scalable high-performance synchronization mechanism implemented in the Linux kernel. On the early stage the linux kernel provided support and environment for the concurrently running applications, but all execution was serialized in the kernel using a single global lock. In our days linux kernel has no single global lock, but provides different mechanisms including [lock-free data structures](http://en.wikipedia.org/wiki/Concurrent_data_structure), [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) data structures and other. One of these mechanisms is - the `read-copy update`. The `RCU` technique is designed for rarely-modified data structures. The idea of the `RCU` is simple. For example we have a rarely-modified data structure. If somebody wants to change this data structure, we make a copy of this data structure and make all changes in the copy. In the same time all other users of the data structure use old version of it. Next, we need to choose safe moment when original version of the data structure will have no users and update it with the modified copy.
|
||||
`RCU` or read-copy update is a scalable high-performance synchronization mechanism implemented in the Linux kernel. On the early stage the linux kernel provided support and environment for the concurrently running applications, but all execution was serialized in the kernel using a single global lock. In our days linux kernel has no single global lock, but provides different mechanisms including [lock-free data structures](http://en.wikipedia.org/wiki/Concurrent_data_structure), [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) data structures and other. One of these mechanisms is - the `read-copy update`. The `RCU` technique is designed for rarely-modified data structures. The idea of the `RCU` is simple. For example we have a rarely-modified data structure. If somebody wants to change this data structure, we make a copy of this data structure and make all changes in the copy. In the same time all other users of the data structure use old version of it. Next, we need to choose safe moment when original version of the data structure will have no users and update it with the modified copy.
|
||||
|
||||
Of course this description of the `RCU` is very simplified. To understand some details about `RCU`, first of all we need to learn some terminology. Data readers in the `RCU` executed in the [critical section](http://en.wikipedia.org/wiki/Critical_section). Every time when data reader get to the critical section, it calls the `rcu_read_lock`, and `rcu_read_unlock` on exit from the critical section. If the thread is not in the critical section, it will be in state which called - `quiescent state`. The moment when every thread is in the `quiescent state` called - `grace period`. If a thread wants to remove an element from the data structure, this occurs in two steps. First step is `removal` - atomically removes element from the data structure, but does not release the physical memory. After this thread-writer announces and waits until it is finished. From this moment, the removed element is available to the thread-readers. After the `grace period` finished, the second step of the element removal will be started, it just removes the element from the physical memory.
|
||||
|
||||
@ -378,7 +378,7 @@ Ok, we already passed the main theme of this part which is `RCU` initialization,
|
||||
|
||||
After we initialized `RCU`, the next step which you can see in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) is the - `trace_init` function. As you can understand from its name, this function initialize [tracing](http://en.wikipedia.org/wiki/Tracing_%28software%29) subsystem. You can read more about linux kernel trace system - [here](http://elinux.org/Kernel_Trace_Systems).
|
||||
|
||||
After the `trace_init`, we can see the call of the `radix_tree_init`. If you are familiar with the different data structures, you can understand from the name of this function that it initializes kernel implementation of the [Radix tree](http://en.wikipedia.org/wiki/Radix_tree). This function is defined in the [lib/radix-tree.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/lib/radix-tree.c) and you can read more about it in the part about [Radix tree](https://0xax.gitbooks.io/linux-insides/content/DataStructures/radix-tree.html).
|
||||
After the `trace_init`, we can see the call of the `radix_tree_init`. If you are familiar with the different data structures, you can understand from the name of this function that it initializes kernel implementation of the [Radix tree](http://en.wikipedia.org/wiki/Radix_tree). This function is defined in the [lib/radix-tree.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/lib/radix-tree.c) and you can read more about it in the part about [Radix tree](https://0xax.gitbooks.io/linux-insides/content/DataStructures/linux-datastructures-2.html).
|
||||
|
||||
In the next step we can see the functions which are related to the `interrupts handling` subsystem, they are:
|
||||
|
||||
@ -394,18 +394,18 @@ The next couple of functions are related with the [perf](https://perf.wiki.kerne
|
||||
local_irq_enable();
|
||||
```
|
||||
|
||||
which expands to the `sti` instruction and making post initialization of the [SLAB](http://en.wikipedia.org/wiki/Slab_allocation) with the call of the `kmem_cache_init_late` function (As I wrote above we will know about the `SLAB` in the [Linux memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) chapter).
|
||||
which expands to the `sti` instruction and making post initialization of the [SLAB](http://en.wikipedia.org/wiki/Slab_allocation) with the call of the `kmem_cache_init_late` function (As I wrote above we will know about the `SLAB` in the [Linux memory management](https://0xax.gitbooks.io/linux-insides/content/MM/index.html) chapter).
|
||||
|
||||
After the post initialization of the `SLAB`, next point is initialization of the console with the `console_init` function from the [drivers/tty/tty_io.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/drivers/tty/tty_io.c).
|
||||
|
||||
After the console initialization, we can see the `lockdep_info` function which prints information about the [Lock dependency validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt). After this, we can see the initialization of the dynamic allocation of the `debug objects` with the `debug_objects_mem_init`, kernel memory leak [detector](https://www.kernel.org/doc/Documentation/kmemleak.txt) initialization with the `kmemleak_init`, `percpu` pageset setup with the `setup_per_cpu_pageset`, setup of the [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) policy with the `numa_policy_init`, setting time for the scheduler with the `sched_clock_init`, `pidmap` initialization with the call of the `pidmap_init` function for the initial `PID` namespace, cache creation with the `anon_vma_init` for the private virtual memory areas and early initialization of the [ACPI](http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) with the `acpi_early_init`.
|
||||
|
||||
This is the end of the ninth part of the [linux kernel initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and here we saw initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update). In the last paragraph of this part (`Rest of the initialization process`) we will go through many functions but did not dive into details about their implementations. Do not worry if you do not know anything about these stuff or you know and do not understand anything about this. As I already wrote many times, we will see details of implementations in other parts or other chapters.
|
||||
This is the end of the ninth part of the [linux kernel initialization process](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and here we saw initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update). In the last paragraph of this part (`Rest of the initialization process`) we will go through many functions but did not dive into details about their implementations. Do not worry if you do not know anything about these stuff or you know and do not understand anything about this. As I already wrote many times, we will see details of implementations in other parts or other chapters.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the ninth part about the linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html). In this part, we looked on the initialization process of the `RCU` subsystem. In the next part we will continue to dive into linux kernel initialization process and I hope that we will finish with the `start_kernel` function and will go to the `rest_init` function from the same [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) source code file and will see the start of the first process.
|
||||
It is the end of the ninth part about the linux kernel [initialization process](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html). In this part, we looked on the initialization process of the `RCU` subsystem. In the next part we will continue to dive into linux kernel initialization process and I hope that we will finish with the `start_kernel` function and will go to the `rest_init` function from the same [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) source code file and will see the start of the first process.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
@ -423,8 +423,8 @@ Links
|
||||
* [integer ID management](https://lwn.net/Articles/103209/)
|
||||
* [Documentation/memory-barriers.txt](https://www.kernel.org/doc/Documentation/memory-barriers.txt)
|
||||
* [Runtime locking correctness validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt)
|
||||
* [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html)
|
||||
* [Per-CPU variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html)
|
||||
* [Linux kernel memory management](https://0xax.gitbooks.io/linux-insides/content/MM/index.html)
|
||||
* [slab](http://en.wikipedia.org/wiki/Slab_allocation)
|
||||
* [i2c](http://en.wikipedia.org/wiki/I%C2%B2C)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-8.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-8.html)
|
||||
|
14
Interrupts/README.md
Normal file
14
Interrupts/README.md
Normal file
@ -0,0 +1,14 @@
|
||||
# Interrupts and Interrupt Handling
|
||||
|
||||
In the following posts, we will cover interrupts and exceptions handling in the linux kernel.
|
||||
|
||||
* [Interrupts and Interrupt Handling. Part 1.](linux-interrupts-1.md) - describes interrupts and interrupt handling theory.
|
||||
* [Interrupts in the Linux Kernel](linux-interrupts-2.md) - describes stuffs related to interrupts and exceptions handling from the early stage.
|
||||
* [Early interrupt handlers](linux-interrupts-3.md) - describes early interrupt handlers.
|
||||
* [Interrupt handlers](linux-interrupts-4.md) - describes first non-early interrupt handlers.
|
||||
* [Implementation of exception handlers](linux-interrupts-5.md) - describes implementation of some exception handlers such as double fault, divide by zero etc.
|
||||
* [Handling non-maskable interrupts](linux-interrupts-6.md) - describes handling of non-maskable interrupts and remaining interrupt handlers from the architecture-specific part.
|
||||
* [External hardware interrupts](linux-interrupts-7.md) - describes early initialization of code which is related to handling external hardware interrupts.
|
||||
* [Non-early initialization of the IRQs](linux-interrupts-8.md) - describes non-early initialization of code which is related to handling external hardware interrupts.
|
||||
* [Softirq, Tasklets and Workqueues](linux-interrupts-9.md) - describes softirqs, tasklets and workqueues concepts.
|
||||
* [Last part](linux-interrupts-10.md) - this is the last part of the `Interrupts and Interrupt Handling` chapter and here we will see a real hardware driver and some interrupts related stuff.
|
@ -4,7 +4,7 @@ Interrupts and Interrupt Handling. Part 1.
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the first part of the new chapter of the [linux insides](http://0xax.gitbooks.io/linux-insides/content/) book. We have come a long way in the previous [chapter](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) of this book. We started from the earliest [steps](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of kernel initialization and finished with the [launch](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-10.html) of the first `init` process. Yes, we saw several initialization steps which are related to the various kernel subsystems. But we did not dig deep into the details of these subsystems. With this chapter, we will try to understand how the various kernel subsystems work and how they are implemented. As you can already understand from the chapter's title, the first subsystem will be [interrupts](http://en.wikipedia.org/wiki/Interrupt).
|
||||
This is the first part of the new chapter of the [linux insides](https://0xax.gitbooks.io/linux-insides/content/) book. We have come a long way in the previous [chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) of this book. We started from the earliest [steps](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of kernel initialization and finished with the [launch](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-10.html) of the first `init` process. Yes, we saw several initialization steps which are related to the various kernel subsystems. But we did not dig deep into the details of these subsystems. With this chapter, we will try to understand how the various kernel subsystems work and how they are implemented. As you can already understand from the chapter's title, the first subsystem will be [interrupts](http://en.wikipedia.org/wiki/Interrupt).
|
||||
|
||||
What is an Interrupt?
|
||||
--------------------------------------------------------------------------------
|
||||
@ -37,7 +37,7 @@ Addresses of each of the interrupt handlers are maintained in a special location
|
||||
BUG_ON((unsigned)n > 0xFF);
|
||||
```
|
||||
|
||||
You can find this check within the Linux kernel source code related to interrupt setup (eg. The `set_intr_gate`, `void set_system_intr_gate` in [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/desc.h)). The first 32 vector numbers from `0` to `31` are reserved by the processor and used for the processing of architecture-defined exceptions and interrupts. You can find the table with the description of these vector numbers in the second part of the Linux kernel initialization process - [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html). Vector numbers from `32` to `255` are designated as user-defined interrupts and are not reserved by the processor. These interrupts are generally assigned to external I/O devices to enable those devices to send interrupts to the processor.
|
||||
You can find this check within the Linux kernel source code related to interrupt setup (eg. The `set_intr_gate`, `void set_system_intr_gate` in [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/desc.h)). The first 32 vector numbers from `0` to `31` are reserved by the processor and used for the processing of architecture-defined exceptions and interrupts. You can find the table with the description of these vector numbers in the second part of the Linux kernel initialization process - [Early interrupt and exception handling](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html). Vector numbers from `32` to `255` are designated as user-defined interrupts and are not reserved by the processor. These interrupts are generally assigned to external I/O devices to enable those devices to send interrupts to the processor.
|
||||
|
||||
Now let's talk about the types of interrupts. Broadly speaking, we can split interrupts into 2 major classes:
|
||||
|
||||
@ -58,7 +58,7 @@ Next a `trap` is an exception which is reported immediately following the execut
|
||||
|
||||
Finally an `abort` is an exception that does not always report the exact instruction which caused the exception and does not allow the interrupted program to be resumed.
|
||||
|
||||
Also we already know from the previous [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html) that interrupts can be classified as `maskable` and `non-maskable`. Maskable interrupts are interrupts which can be blocked with the two following instructions for `x86_64` - `sti` and `cli`. We can find them in the Linux kernel source code:
|
||||
Also we already know from the previous [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html) that interrupts can be classified as `maskable` and `non-maskable`. Maskable interrupts are interrupts which can be blocked with the two following instructions for `x86_64` - `sti` and `cli`. We can find them in the Linux kernel source code:
|
||||
|
||||
```C
|
||||
static inline void native_irq_disable(void)
|
||||
@ -135,13 +135,13 @@ If multiple exceptions or interrupts occur at the same time, the processor handl
|
||||
+--------------+-------------------------------------------------+
|
||||
```
|
||||
|
||||
Now that we know a little about the various types of interrupts and exceptions, it is time to move on to a more practical part. We start with the description of the `Interrupt Descriptor Table`. As mentioned earlier, the `IDT` stores entry points of the interrupts and exceptions handlers. The `IDT` is similar in structure to the `Global Descriptor Table` which we saw in the second part of the [Kernel booting process](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html). But of course it has some differences. Instead of `descriptors`, the `IDT` entries are called `gates`. It can contain one of the following gates:
|
||||
Now that we know a little about the various types of interrupts and exceptions, it is time to move on to a more practical part. We start with the description of the `Interrupt Descriptor Table`. As mentioned earlier, the `IDT` stores entry points of the interrupts and exceptions handlers. The `IDT` is similar in structure to the `Global Descriptor Table` which we saw in the second part of the [Kernel booting process](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html). But of course it has some differences. Instead of `descriptors`, the `IDT` entries are called `gates`. It can contain one of the following gates:
|
||||
|
||||
* Interrupt gates
|
||||
* Task gates
|
||||
* Trap gates.
|
||||
|
||||
in the `x86` architecture. Only [long mode](http://en.wikipedia.org/wiki/Long_mode) interrupt gates and trap gates can be referenced in the `x86_64`. Like the `Global Descriptor Table`, the `Interrupt Descriptor table` is an array of 8-byte gates on `x86` and an array of 16-byte gates on `x86_64`. We can remember from the second part of the [Kernel booting process](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html), that `Global Descriptor Table` must contain `NULL` descriptor as its first element. Unlike the `Global Descriptor Table`, the `Interrupt Descriptor Table` may contain a gate; it is not mandatory. For example, you may remember that we have loaded the Interrupt Descriptor table with the `NULL` gates only in the earlier [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html) while transitioning into [protected mode](http://en.wikipedia.org/wiki/Protected_mode):
|
||||
in the `x86` architecture. Only [long mode](http://en.wikipedia.org/wiki/Long_mode) interrupt gates and trap gates can be referenced in the `x86_64`. Like the `Global Descriptor Table`, the `Interrupt Descriptor table` is an array of 8-byte gates on `x86` and an array of 16-byte gates on `x86_64`. We can remember from the second part of the [Kernel booting process](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html), that `Global Descriptor Table` must contain `NULL` descriptor as its first element. Unlike the `Global Descriptor Table`, the `Interrupt Descriptor Table` may contain a gate; it is not mandatory. For example, you may remember that we have loaded the Interrupt Descriptor table with the `NULL` gates only in the earlier [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html) while transitioning into [protected mode](http://en.wikipedia.org/wiki/Protected_mode):
|
||||
|
||||
```C
|
||||
/*
|
||||
@ -284,7 +284,7 @@ The `PAGE_SIZE` is `4096`-bytes and the `THREAD_SIZE_ORDER` depends on the `KASA
|
||||
#endif
|
||||
```
|
||||
|
||||
`KASan` is a runtime memory [debugger](http://lwn.net/Articles/618180/). Thus, the `THREAD_SIZE` will be `16384` bytes if `CONFIG_KASAN` is disabled or `32768` if this kernel configuration option is enabled. These stacks contain useful data as long as a thread is alive or in a zombie state. While the thread is in user-space, the kernel stack is empty except for the `thread_info` structure (details about this structure are available in the fourth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process) at the bottom of the stack. The active or zombie threads aren't the only threads with their own stack. There also exist specialized stacks that are associated with each available CPU. These stacks are active when the kernel is executing on that CPU. When the user-space is executing on the CPU, these stacks do not contain any useful information. Each CPU has a few special per-cpu stacks as well. The first is the `interrupt stack` used for the external hardware interrupts. Its size is determined as follows:
|
||||
`KASan` is a runtime memory [debugger](http://lwn.net/Articles/618180/). Thus, the `THREAD_SIZE` will be `16384` bytes if `CONFIG_KASAN` is disabled or `32768` if this kernel configuration option is enabled. These stacks contain useful data as long as a thread is alive or in a zombie state. While the thread is in user-space, the kernel stack is empty except for the `thread_info` structure (details about this structure are available in the fourth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process) at the bottom of the stack. The active or zombie threads aren't the only threads with their own stack. There also exist specialized stacks that are associated with each available CPU. These stacks are active when the kernel is executing on that CPU. When the user-space is executing on the CPU, these stacks do not contain any useful information. Each CPU has a few special per-cpu stacks as well. The first is the `interrupt stack` used for the external hardware interrupts. Its size is determined as follows:
|
||||
|
||||
```C
|
||||
#define IRQ_STACK_ORDER (2 + KASAN_STACK_ORDER)
|
||||
@ -306,7 +306,7 @@ union irq_stack_union {
|
||||
|
||||
The first `irq_stack` field is a 16 kilobytes array. Also you can see that `irq_stack_union` contains a structure with the two fields:
|
||||
|
||||
* `gs_base` - The `gs` register always points to the bottom of the `irqstack` union. On the `x86_64`, the `gs` register is shared by per-cpu area and stack canary (more about `per-cpu` variables you can read in the special [part](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)). All per-cpu symbols are zero based and the `gs` points to the base of the per-cpu area. You already know that [segmented memory model](http://en.wikipedia.org/wiki/Memory_segmentation) is abolished in the long mode, but we can set the base address for the two segment registers - `fs` and `gs` with the [Model specific registers](http://en.wikipedia.org/wiki/Model-specific_register) and these registers can be still be used as address registers. If you remember the first [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of the Linux kernel initialization process, you can remember that we have set the `gs` register:
|
||||
* `gs_base` - The `gs` register always points to the bottom of the `irqstack` union. On the `x86_64`, the `gs` register is shared by per-cpu area and stack canary (more about `per-cpu` variables you can read in the special [part](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html)). All per-cpu symbols are zero based and the `gs` points to the base of the per-cpu area. You already know that [segmented memory model](http://en.wikipedia.org/wiki/Memory_segmentation) is abolished in the long mode, but we can set the base address for the two segment registers - `fs` and `gs` with the [Model specific registers](http://en.wikipedia.org/wiki/Model-specific_register) and these registers can be still be used as address registers. If you remember the first [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of the Linux kernel initialization process, you can remember that we have set the `gs` register:
|
||||
|
||||
```assembly
|
||||
movl $MSR_GS_BASE,%ecx
|
||||
@ -488,4 +488,4 @@ Links
|
||||
* [segmented memory model](http://en.wikipedia.org/wiki/Memory_segmentation)
|
||||
* [Model specific registers](http://en.wikipedia.org/wiki/Model-specific_register)
|
||||
* [Stack canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow#Stack_canaries)
|
||||
* [Previous chapter](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html)
|
||||
* [Previous chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html)
|
@ -4,7 +4,7 @@ Interrupts and Interrupt Handling. Part 10.
|
||||
Last part
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
This is the tenth part of the [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) about interrupts and interrupt handling in the Linux kernel and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html) we saw a little about deferred interrupts and related concepts like `softirq`, `tasklet` and `workqeue`. In this part we will continue to dive into this theme and now it's time to look at real hardware driver.
|
||||
This is the tenth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) about interrupts and interrupt handling in the Linux kernel and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-9.html) we saw a little about deferred interrupts and related concepts like `softirq`, `tasklet` and `workqeue`. In this part we will continue to dive into this theme and now it's time to look at real hardware driver.
|
||||
|
||||
Let's consider serial driver of the [StrongARM** SA-110/21285 Evaluation Board](http://netwinder.osuosl.org/pub/netwinder/docs/intel/datashts/27813501.pdf) board for example and will look how this driver requests an [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) line,
|
||||
what happens when an interrupt is triggered and etc. The source code of this driver is placed in the [drivers/tty/serial/21285.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/drivers/tty/serial/21285.c) source code file. Ok, we have source code, let's start.
|
||||
@ -210,7 +210,7 @@ int request_threaded_irq(unsigned int irq, irq_handler_t handler,
|
||||
}
|
||||
```
|
||||
|
||||
We arelady saw the `irqaction` and the `irq_desc` structures in this chapter. The first structure represents per interrupt action descriptor and contains pointers to the interrupt handler, name of the device, interrupt number, etc. The second structure represents a descriptor of an interrupt and contains pointer to the `irqaction`, interrupt flags, etc. Note that the `request_threaded_irq` function called by the `request_irq` with the additioanal parameter: `irq_handler_t thread_fn`. If this parameter is not `NULL`, the `irq` thread will be created and the given `irq` handler will be executed in this thread. In the next step we need to make following checks:
|
||||
We already saw the `irqaction` and the `irq_desc` structures in this chapter. The first structure represents per interrupt action descriptor and contains pointers to the interrupt handler, name of the device, interrupt number, etc. The second structure represents a descriptor of an interrupt and contains pointer to the `irqaction`, interrupt flags, etc. Note that the `request_threaded_irq` function called by the `request_irq` with the additional parameter: `irq_handler_t thread_fn`. If this parameter is not `NULL`, the `irq` thread will be created and the given `irq` handler will be executed in this thread. In the next step we need to make following checks:
|
||||
|
||||
```C
|
||||
if (((irqflags & IRQF_SHARED) && !dev_id) ||
|
||||
@ -261,7 +261,7 @@ if (!action)
|
||||
return -ENOMEM;
|
||||
```
|
||||
|
||||
More about `kzalloc` will be in the separate chapter about [memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) in the Linux kernel. As we allocated space for the `irqaction`, we start to initialize this structure with the values of interrupt handler, interrupt flags, device name, etc:
|
||||
More about `kzalloc` will be in the separate chapter about [memory management](https://0xax.gitbooks.io/linux-insides/content/MM/index.html) in the Linux kernel. As we allocated space for the `irqaction`, we start to initialize this structure with the values of interrupt handler, interrupt flags, device name, etc:
|
||||
|
||||
```C
|
||||
action->handler = handler;
|
||||
@ -301,7 +301,7 @@ And fill the rest of the given interrupt descriptor fields in the end. So, our `
|
||||
Prepare to handle an interrupt
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In the previous paragraph we saw the requesting of the irq line for the given interrupt descriptor and registration of the `irqaction` structure for the given interrupt. We already know that when an interrupt event occurs, an interrupt controller notifies the processor about this event and processor tries to find appropriate interrupt gate for this interrupt. If you have read the eighth [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-8.html) of this chapter, you may remember the `native_init_IRQ` function. This function makes initialization of the local [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller). The following part of this function is the most interesting part for us right now:
|
||||
In the previous paragraph we saw the requesting of the irq line for the given interrupt descriptor and registration of the `irqaction` structure for the given interrupt. We already know that when an interrupt event occurs, an interrupt controller notifies the processor about this event and processor tries to find appropriate interrupt gate for this interrupt. If you have read the eighth [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-8.html) of this chapter, you may remember the `native_init_IRQ` function. This function makes initialization of the local [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller). The following part of this function is the most interesting part for us right now:
|
||||
|
||||
```C
|
||||
for_each_clear_bit_from(i, used_vectors, first_system_vector) {
|
||||
@ -346,7 +346,7 @@ common_interrupt:
|
||||
interrupt do_IRQ
|
||||
```
|
||||
|
||||
The macro `interrupt` defined in the same source code file and saves [general purpose](https://en.wikipedia.org/wiki/Processor_register) registers on the stack, change the userspace `gs` on the kernel with the `SWAPGS` assembler instruction if need, increase [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) - `irq_count` variable that shows that we are in interrupt and call the `do_IRQ` function. This function defined in the [arch/x86/kernel/irq.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/irq.c) source code file and handles our device interrupt. Let's look at this function. The `do_IRQ` function takes one parameter - `pt_regs` structure that stores values of the userspace registers:
|
||||
The macro `interrupt` defined in the same source code file and saves [general purpose](https://en.wikipedia.org/wiki/Processor_register) registers on the stack, change the userspace `gs` on the kernel with the `SWAPGS` assembler instruction if need, increase [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) - `irq_count` variable that shows that we are in interrupt and call the `do_IRQ` function. This function defined in the [arch/x86/kernel/irq.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/irq.c) source code file and handles our device interrupt. Let's look at this function. The `do_IRQ` function takes one parameter - `pt_regs` structure that stores values of the userspace registers:
|
||||
|
||||
```C
|
||||
__visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
|
||||
@ -413,7 +413,7 @@ We already know that when an `IRQ` finishes its work, deferred interrupts will b
|
||||
Exit from interrupt
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Ok, the interrupt handler finished its execution and now we must return from the interrupt. When the work of the `do_IRQ` function will be finsihed, we will return back to the assembler code in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry_entry_64.S) to the `ret_from_intr` label. First of all we disable interrupts with the `DISABLE_INTERRUPTS` macro that expands to the `cli` instruction and decreases value of the `irq_count` [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable. Remember, this variable had value - `1`, when we were in interrupt context:
|
||||
Ok, the interrupt handler finished its execution and now we must return from the interrupt. When the work of the `do_IRQ` function will be finsihed, we will return back to the assembler code in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry_entry_64.S) to the `ret_from_intr` label. First of all we disable interrupts with the `DISABLE_INTERRUPTS` macro that expands to the `cli` instruction and decreases value of the `irq_count` [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) variable. Remember, this variable had value - `1`, when we were in interrupt context:
|
||||
|
||||
```assembly
|
||||
DISABLE_INTERRUPTS(CLBR_NONE)
|
||||
@ -448,7 +448,7 @@ That's all.
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the tenth part of the [Interrupts and Interrupt Handling](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and as you have read in the beginning of this part - it is the last part of this chapter. This chapter started from the explanation of the theory of interrupts and we have learned what is it interrupt and kinds of interrupts, then we saw exceptions and handling of this kind of interrupts, deferred interrupts and finally we looked on the hardware interrupts and the handling of theirs in this part. Of course, this part and even this chapter does not cover full aspects of interrupts and interrupt handling in the Linux kernel. It is not realistic to do this. At least for me. It was the big part, I don't know how about you, but it was really big for me. This theme is much bigger than this chapter and I am not sure that somewhere there is a book that covers it. We have missed many part and aspects of interrupts and interrupt handling, but I think it will be good point to dive in the kernel code related to the interrupts and interrupts handling.
|
||||
It is the end of the tenth part of the [Interrupts and Interrupt Handling](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) chapter and as you have read in the beginning of this part - it is the last part of this chapter. This chapter started from the explanation of the theory of interrupts and we have learned what is it interrupt and kinds of interrupts, then we saw exceptions and handling of this kind of interrupts, deferred interrupts and finally we looked on the hardware interrupts and the handling of theirs in this part. Of course, this part and even this chapter does not cover full aspects of interrupts and interrupt handling in the Linux kernel. It is not realistic to do this. At least for me. It was the big part, I don't know how about you, but it was really big for me. This theme is much bigger than this chapter and I am not sure that somewhere there is a book that covers it. We have missed many part and aspects of interrupts and interrupt handling, but I think it will be good point to dive in the kernel code related to the interrupts and interrupts handling.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
@ -464,13 +464,13 @@ Links
|
||||
* [initcall](http://kernelnewbies.org/Documents/InitcallMechanism)
|
||||
* [uart](https://en.wikipedia.org/wiki/Universal_asynchronous_receiver/transmitter)
|
||||
* [ISA](https://en.wikipedia.org/wiki/Industry_Standard_Architecture)
|
||||
* [memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html)
|
||||
* [memory management](https://0xax.gitbooks.io/linux-insides/content/MM/index.html)
|
||||
* [i2c](https://en.wikipedia.org/wiki/I%C2%B2C)
|
||||
* [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)
|
||||
* [GNU assembler](https://en.wikipedia.org/wiki/GNU_Assembler)
|
||||
* [Processor register](https://en.wikipedia.org/wiki/Processor_register)
|
||||
* [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html)
|
||||
* [pid](https://en.wikipedia.org/wiki/Process_identifier)
|
||||
* [device tree](https://en.wikipedia.org/wiki/Device_tree)
|
||||
* [system calls](https://en.wikipedia.org/wiki/System_call)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-9.html)
|
@ -4,7 +4,7 @@ Interrupts and Interrupt Handling. Part 2.
|
||||
Start to dive into interrupt and exceptions handling in the Linux kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
We saw some theory about interrupts and exception handling in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) and as I already wrote in that part, we will start to dive into interrupts and exceptions in the Linux kernel source code in this part. As you already can note, the previous part mostly described theoretical aspects and in this part we will start to dive directly into the Linux kernel source code. We will start to do it as we did it in other chapters, from the very early places. We will not see the Linux kernel source code from the earliest [code lines](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/header.S#L292) as we saw it for example in the [Linux kernel booting process](http://0xax.gitbooks.io/linux-insides/content/Booting/index.html) chapter, but we will start from the earliest code which is related to the interrupts and exceptions. In this part we will try to go through the all interrupts and exceptions related stuff which we can find in the Linux kernel source code.
|
||||
We saw some theory about interrupts and exception handling in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-1.html) and as I already wrote in that part, we will start to dive into interrupts and exceptions in the Linux kernel source code in this part. As you already can note, the previous part mostly described theoretical aspects and in this part we will start to dive directly into the Linux kernel source code. We will start to do it as we did it in other chapters, from the very early places. We will not see the Linux kernel source code from the earliest [code lines](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/header.S#L292) as we saw it for example in the [Linux kernel booting process](https://0xax.gitbooks.io/linux-insides/content/Booting/index.html) chapter, but we will start from the earliest code which is related to the interrupts and exceptions. In this part we will try to go through the all interrupts and exceptions related stuff which we can find in the Linux kernel source code.
|
||||
|
||||
If you've read the previous parts, you can remember that the earliest place in the Linux kernel `x86_64` architecture-specific source code which is related to the interrupt is located in the [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/pm.c) source code file and represents the first setup of the [Interrupt Descriptor Table](http://en.wikipedia.org/wiki/Interrupt_descriptor_table). It occurs right before the transition into the [protected mode](http://en.wikipedia.org/wiki/Protected_mode) in the `go_to_protected_mode` function by the call of the `setup_idt`:
|
||||
|
||||
@ -38,7 +38,7 @@ struct gdt_ptr {
|
||||
|
||||
Of course in our case the `gdt_ptr` does not represent the `GDTR` register, but `IDTR` since we set `Interrupt Descriptor Table`. You will not find an `idt_ptr` structure, because if it had been in the Linux kernel source code, it would have been the same as `gdt_ptr` but with different name. So, as you can understand there is no sense to have two similar structures which differ only by name. You can note here, that we do not fill the `Interrupt Descriptor Table` with entries, because it is too early to handle any interrupts or exceptions at this point. That's why we just fill the `IDT` with `NULL`.
|
||||
|
||||
After the setup of the [Interrupt descriptor table](http://en.wikipedia.org/wiki/Interrupt_descriptor_table), [Global Descriptor Table](http://en.wikipedia.org/wiki/GDT) and other stuff we jump into [protected mode](http://en.wikipedia.org/wiki/Protected_mode) in the - [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/pmjump.S). You can read more about it in the [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html) which describes the transition to protected mode.
|
||||
After the setup of the [Interrupt descriptor table](http://en.wikipedia.org/wiki/Interrupt_descriptor_table), [Global Descriptor Table](http://en.wikipedia.org/wiki/GDT) and other stuff we jump into [protected mode](http://en.wikipedia.org/wiki/Protected_mode) in the - [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/pmjump.S). You can read more about it in the [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html) which describes the transition to protected mode.
|
||||
|
||||
We already know from the earliest parts that entry to protected mode is located in the `boot_params.hdr.code32_start` and you can see that we pass the entry of the protected mode and `boot_params` to the `protected_mode_jump` in the end of the [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/pm.c):
|
||||
|
||||
@ -100,7 +100,7 @@ else
|
||||
endif
|
||||
```
|
||||
|
||||
Now as we jumped on the `startup_32` from the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S) we will not find anything related to the interrupt handling here. The `startup_32` contains code that makes preparations before the transition into [long mode](http://en.wikipedia.org/wiki/Long_mode) and directly jumps in to it. The `long mode` entry is located in `startup_64` and it makes preparations before the [kernel decompression](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) that occurs in the `decompress_kernel` from the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/misc.c). After the kernel is decompressed, we jump on the `startup_64` from the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head_64.S). In the `startup_64` we start to build identity-mapped pages. After we have built identity-mapped pages, checked the [NX](http://en.wikipedia.org/wiki/NX_bit) bit, setup the `Extended Feature Enable Register` (see in links), and updated the early `Global Descriptor Table` with the `lgdt` instruction, we need to setup `gs` register with the following code:
|
||||
Now as we jumped on the `startup_32` from the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S) we will not find anything related to the interrupt handling here. The `startup_32` contains code that makes preparations before the transition into [long mode](http://en.wikipedia.org/wiki/Long_mode) and directly jumps in to it. The `long mode` entry is located in `startup_64` and it makes preparations before the [kernel decompression](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) that occurs in the `decompress_kernel` from the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/misc.c). After the kernel is decompressed, we jump on the `startup_64` from the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head_64.S). In the `startup_64` we start to build identity-mapped pages. After we have built identity-mapped pages, checked the [NX](http://en.wikipedia.org/wiki/NX_bit) bit, setup the `Extended Feature Enable Register` (see in links), and updated the early `Global Descriptor Table` with the `lgdt` instruction, we need to setup `gs` register with the following code:
|
||||
|
||||
```assembly
|
||||
movl $MSR_GS_BASE,%ecx
|
||||
@ -109,7 +109,7 @@ movl initial_gs+4(%rip),%edx
|
||||
wrmsr
|
||||
```
|
||||
|
||||
We already saw this code in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html). First of all pay attention on the last `wrmsr` instruction. This instruction writes data from the `edx:eax` registers to the [model specific register](http://en.wikipedia.org/wiki/Model-specific_register) specified by the `ecx` register. We can see that `ecx` contains `$MSR_GS_BASE` which is declared in the [arch/x86/include/uapi/asm/msr-index.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/uapi/asm/msr-index.h) and looks like:
|
||||
We already saw this code in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-1.html). First of all pay attention on the last `wrmsr` instruction. This instruction writes data from the `edx:eax` registers to the [model specific register](http://en.wikipedia.org/wiki/Model-specific_register) specified by the `ecx` register. We can see that `ecx` contains `$MSR_GS_BASE` which is declared in the [arch/x86/include/uapi/asm/msr-index.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/uapi/asm/msr-index.h) and looks like:
|
||||
|
||||
```C
|
||||
#define MSR_GS_BASE 0xc0000101
|
||||
@ -183,7 +183,7 @@ movl initial_gs+4(%rip),%edx
|
||||
wrmsr
|
||||
```
|
||||
|
||||
Here we specified a model specific register with `MSR_GS_BASE`, put the 64-bit address of the `initial_gs` to the `edx:eax` pair and execute the `wrmsr` instruction for filling the `gs` register with the base address of the `init_per_cpu__irq_stack_union` which will be at the bottom of the interrupt stack. After this we will jump to the C code on the `x86_64_start_kernel` from the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head64.c). In the `x86_64_start_kernel` function we do the last preparations before we jump into the generic and architecture-independent kernel code and one of these preparations is filling the early `Interrupt Descriptor Table` with the interrupts handlers entries or `early_idt_handlers`. You can remember it, if you have read the part about the [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) and can remember following code:
|
||||
Here we specified a model specific register with `MSR_GS_BASE`, put the 64-bit address of the `initial_gs` to the `edx:eax` pair and execute the `wrmsr` instruction for filling the `gs` register with the base address of the `init_per_cpu__irq_stack_union` which will be at the bottom of the interrupt stack. After this we will jump to the C code on the `x86_64_start_kernel` from the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head64.c). In the `x86_64_start_kernel` function we do the last preparations before we jump into the generic and architecture-independent kernel code and one of these preparations is filling the early `Interrupt Descriptor Table` with the interrupts handlers entries or `early_idt_handlers`. You can remember it, if you have read the part about the [Early interrupt and exception handling](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) and can remember following code:
|
||||
|
||||
```C
|
||||
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
|
||||
@ -224,12 +224,12 @@ ENTRY(early_idt_handler_array)
|
||||
ENDPROC(early_idt_handler_common)
|
||||
```
|
||||
|
||||
It fills `early_idt_handler_arry` with the `.rept NUM_EXCEPTION_VECTORS` and contains entry of the `early_make_pgtable` interrupt handler (more about its implementation you can read in the part about [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html)). For now we come to the end of the `x86_64` architecture-specific code and the next part is the generic kernel code. Of course you already can know that we will return to the architecture-specific code in the `setup_arch` function and other places, but this is the end of the `x86_64` early code.
|
||||
It fills `early_idt_handler_arry` with the `.rept NUM_EXCEPTION_VECTORS` and contains entry of the `early_make_pgtable` interrupt handler (more about its implementation you can read in the part about [Early interrupt and exception handling](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html)). For now we come to the end of the `x86_64` architecture-specific code and the next part is the generic kernel code. Of course you already can know that we will return to the architecture-specific code in the `setup_arch` function and other places, but this is the end of the `x86_64` early code.
|
||||
|
||||
Setting stack canary for the interrupt stack
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
The next stop after the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head_64.S) is the biggest `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c). If you've read the previous [chapter](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) about the Linux kernel initialization process, you must remember it. This function does all initialization stuff before kernel will launch first `init` process with the [pid](https://en.wikipedia.org/wiki/Process_identifier) - `1`. The first thing that is related to the interrupts and exceptions handling is the call of the `boot_init_stack_canary` function.
|
||||
The next stop after the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head_64.S) is the biggest `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c). If you've read the previous [chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) about the Linux kernel initialization process, you must remember it. This function does all initialization stuff before kernel will launch first `init` process with the [pid](https://en.wikipedia.org/wiki/Process_identifier) - `1`. The first thing that is related to the interrupts and exceptions handling is the call of the `boot_init_stack_canary` function.
|
||||
|
||||
This function sets the [canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow#Stack_canaries) value to protect interrupt stack overflow. We already saw a little some details about implementation of the `boot_init_stack_canary` in the previous part and now let's take a closer look on it. You can find implementation of this function in the [arch/x86/include/asm/stackprotector.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/stackprotector.h) and its depends on the `CONFIG_CC_STACKPROTECTOR` kernel configuration option. If this option is not set this function will not do anything:
|
||||
|
||||
@ -245,7 +245,7 @@ static inline void boot_init_stack_canary(void)
|
||||
#endif
|
||||
```
|
||||
|
||||
If the `CONFIG_CC_STACKPROTECTOR` kernel configuration option is set, the `boot_init_stack_canary` function starts from the check stat `irq_stack_union` that represents [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) interrupt stack has offset equal to forty bytes from the `stack_canary` value:
|
||||
If the `CONFIG_CC_STACKPROTECTOR` kernel configuration option is set, the `boot_init_stack_canary` function starts from the check stat `irq_stack_union` that represents [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) interrupt stack has offset equal to forty bytes from the `stack_canary` value:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_X86_64
|
||||
@ -253,7 +253,7 @@ If the `CONFIG_CC_STACKPROTECTOR` kernel configuration option is set, the `boot_
|
||||
#endif
|
||||
```
|
||||
|
||||
As we can read in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) the `irq_stack_union` represented by the following union:
|
||||
As we can read in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-1.html) the `irq_stack_union` represented by the following union:
|
||||
|
||||
```C
|
||||
union irq_stack_union {
|
||||
@ -266,7 +266,7 @@ union irq_stack_union {
|
||||
};
|
||||
```
|
||||
|
||||
which defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/processor.h). We know that [union](http://en.wikipedia.org/wiki/Union_type) in the [C](http://en.wikipedia.org/wiki/C_%28programming_language%29) programming language is a data structure which stores only one field in a memory. We can see here that structure has first field - `gs_base` which is 40 bytes size and represents bottom of the `irq_stack`. So, after this our check with the `BUILD_BUG_ON` macro should end successfully. (you can read the first part about Linux kernel initialization [process](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) if you're interesting about the `BUILD_BUG_ON` macro).
|
||||
which defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/processor.h). We know that [union](http://en.wikipedia.org/wiki/Union_type) in the [C](http://en.wikipedia.org/wiki/C_%28programming_language%29) programming language is a data structure which stores only one field in a memory. We can see here that structure has first field - `gs_base` which is 40 bytes size and represents bottom of the `irq_stack`. So, after this our check with the `BUILD_BUG_ON` macro should end successfully. (you can read the first part about Linux kernel initialization [process](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) if you're interesting about the `BUILD_BUG_ON` macro).
|
||||
|
||||
After this we calculate new `canary` value based on the random number and [Time Stamp Counter](http://en.wikipedia.org/wiki/Time_Stamp_Counter):
|
||||
|
||||
@ -402,7 +402,7 @@ WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
|
||||
Early trap initialization during kernel initialization
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next functions after the `local_disable_irq` are `boot_cpu_init` and `page_address_init`, but they are not related to the interrupts and exceptions (more about this functions you can read in the chapter about Linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html)). The next is the `setup_arch` function. As you can remember this function located in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel.setup.c) source code file and makes initialization of many different architecture-dependent [stuff](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html). The first interrupts related function which we can see in the `setup_arch` is the - `early_trap_init` function. This function defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/traps.c) and fills `Interrupt Descriptor Table` with the couple of entries:
|
||||
The next functions after the `local_disable_irq` are `boot_cpu_init` and `page_address_init`, but they are not related to the interrupts and exceptions (more about this functions you can read in the chapter about Linux kernel [initialization process](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html)). The next is the `setup_arch` function. As you can remember this function located in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel.setup.c) source code file and makes initialization of many different architecture-dependent [stuff](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html). The first interrupts related function which we can see in the `setup_arch` is the - `early_trap_init` function. This function defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/traps.c) and fills `Interrupt Descriptor Table` with the couple of entries:
|
||||
|
||||
```C
|
||||
void __init early_trap_init(void)
|
||||
@ -432,7 +432,7 @@ static inline void set_intr_gate_ist(int n, void *addr, unsigned ist)
|
||||
}
|
||||
```
|
||||
|
||||
First of all we can see the check that `n` which is [vector number](http://en.wikipedia.org/wiki/Interrupt_vector_table) of the interrupt is not greater than `0xff` or 255. We need to check it because we remember from the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) that vector number of an interrupt must be between `0` and `255`. In the next step we can see the call of the `_set_gate` function that sets a given interrupt gate to the `IDT` table:
|
||||
First of all we can see the check that `n` which is [vector number](http://en.wikipedia.org/wiki/Interrupt_vector_table) of the interrupt is not greater than `0xff` or 255. We need to check it because we remember from the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-1.html) that vector number of an interrupt must be between `0` and `255`. In the next step we can see the call of the `_set_gate` function that sets a given interrupt gate to the `IDT` table:
|
||||
|
||||
```C
|
||||
static inline void _set_gate(int gate, unsigned type, void *addr,
|
||||
@ -544,4 +544,4 @@ Links
|
||||
* [vector number](http://en.wikipedia.org/wiki/Interrupt_vector_table)
|
||||
* [Interrupt Stack Table](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks)
|
||||
* [Privilege level](http://en.wikipedia.org/wiki/Privilege_level)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-1.html)
|
@ -4,7 +4,7 @@ Interrupts and Interrupt Handling. Part 3.
|
||||
Exception Handling
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the third part of the [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) about an interrupts and an exceptions handling in the Linux kernel and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) we stopped at the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blame/master/arch/x86/kernel/setup.c) source code file.
|
||||
This is the third part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) about an interrupts and an exceptions handling in the Linux kernel and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) we stopped at the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blame/master/arch/x86/kernel/setup.c) source code file.
|
||||
|
||||
We already know that this function executes initialization of architecture-specific stuff. In our case the `setup_arch` function does [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture related initializations. The `setup_arch` is big function, and in the previous part we stopped on the setting of the two exceptions handlers for the two following exceptions:
|
||||
|
||||
@ -516,7 +516,7 @@ Links
|
||||
* [system call](http://en.wikipedia.org/wiki/System_call)
|
||||
* [swapgs](http://www.felixcloutier.com/x86/SWAPGS.html)
|
||||
* [SIGTRAP](https://en.wikipedia.org/wiki/Unix_signal#SIGTRAP)
|
||||
* [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [Per-CPU variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html)
|
||||
* [kgdb](https://en.wikipedia.org/wiki/KGDB)
|
||||
* [ACPI](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html)
|
@ -4,7 +4,7 @@ Interrupts and Interrupt Handling. Part 4.
|
||||
Initialization of non-early interrupt gates
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is fourth part about an interrupts and exceptions handling in the Linux kernel and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html) we saw first early `#DB` and `#BP` exceptions handlers from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). We stopped on the right after the `early_trap_init` function that called in the `setup_arch` function which defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/setup.c). In this part we will continue to dive into an interrupts and exceptions handling in the Linux kernel for `x86_64` and continue to do it from the place where we left off in the last part. First thing which is related to the interrupts and exceptions handling is the setup of the `#PF` or [page fault](https://en.wikipedia.org/wiki/Page_fault) handler with the `early_trap_pf_init` function. Let's start from it.
|
||||
This is fourth part about an interrupts and exceptions handling in the Linux kernel and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-3.html) we saw first early `#DB` and `#BP` exceptions handlers from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). We stopped on the right after the `early_trap_init` function that called in the `setup_arch` function which defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/setup.c). In this part we will continue to dive into an interrupts and exceptions handling in the Linux kernel for `x86_64` and continue to do it from the place where we left off in the last part. First thing which is related to the interrupts and exceptions handling is the setup of the `#PF` or [page fault](https://en.wikipedia.org/wiki/Page_fault) handler with the `early_trap_pf_init` function. Let's start from it.
|
||||
|
||||
Early page fault handler
|
||||
--------------------------------------------------------------------------------
|
||||
@ -20,7 +20,7 @@ void __init early_trap_pf_init(void)
|
||||
}
|
||||
```
|
||||
|
||||
This macro defined in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/desc.h). We already saw macros like this in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html) - `set_system_intr_gate` and `set_intr_gate_ist`. This macro checks that given vector number is not greater than `255` (maximum vector number) and calls `_set_gate` function as `set_system_intr_gate` and `set_intr_gate_ist` did it:
|
||||
This macro defined in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/desc.h). We already saw macros like this in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-3.html) - `set_system_intr_gate` and `set_intr_gate_ist`. This macro checks that given vector number is not greater than `255` (maximum vector number) and calls `_set_gate` function as `set_system_intr_gate` and `set_intr_gate_ist` did it:
|
||||
|
||||
```C
|
||||
#define set_intr_gate(n, addr) \
|
||||
@ -64,7 +64,7 @@ When the `early_trap_pf_init` will be called, the `set_intr_gate` will be expand
|
||||
trace_idtentry page_fault do_page_fault has_error_code=1
|
||||
```
|
||||
|
||||
We saw in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html) how `#DB` and `#BP` handlers defined. They were defined with the `idtentry` macro, but here we can see `trace_idtentry`. This macro defined in the same source code file and depends on the `CONFIG_TRACING` kernel configuration option:
|
||||
We saw in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-3.html) how `#DB` and `#BP` handlers defined. They were defined with the `idtentry` macro, but here we can see `trace_idtentry`. This macro defined in the same source code file and depends on the `CONFIG_TRACING` kernel configuration option:
|
||||
|
||||
```assembly
|
||||
#ifdef CONFIG_TRACING
|
||||
@ -79,7 +79,7 @@ idtentry \sym \do_sym has_error_code=\has_error_code
|
||||
#endif
|
||||
```
|
||||
|
||||
We will not dive into exceptions [Tracing](https://en.wikipedia.org/wiki/Tracing_%28software%29) now. If `CONFIG_TRACING` is not set, we can see that `trace_idtentry` macro just expands to the normal `idtentry`. We already saw implementation of the `idtentry` macro in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html), so let's start from the `page_fault` exception handler.
|
||||
We will not dive into exceptions [Tracing](https://en.wikipedia.org/wiki/Tracing_%28software%29) now. If `CONFIG_TRACING` is not set, we can see that `trace_idtentry` macro just expands to the normal `idtentry`. We already saw implementation of the `idtentry` macro in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-3.html), so let's start from the `page_fault` exception handler.
|
||||
|
||||
As we can see in the `idtentry` definition, the handler of the `page_fault` is `do_page_fault` function which defined in the [arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/mm/fault.c) and as all exceptions handlers it takes two arguments:
|
||||
|
||||
@ -197,7 +197,7 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx)
|
||||
}
|
||||
```
|
||||
|
||||
Here we can see `proc_root_readdir` function which will be called when the Linux [VFS](https://en.wikipedia.org/wiki/Virtual_file_system) needs to read the `root` directory contents. If condition marked with `unlikely`, compiler can put `false` code right after branching. Now let's back to the our address check. Comparison between the given address and the `0x00007ffffffff000` will give us to know, was page fault in the kernel mode or user mode. After this check we know it. After this `__do_page_fault` routine will try to understand the problem that provoked page fault exception and then will pass address to the appropriate routine. It can be `kmemcheck` fault, spurious fault, [kprobes](https://www.kernel.org/doc/Documentation/kprobes.txt) fault and etc. Will not dive into implementation details of the page fault exception handler in this part, because we need to know many different concepts which are provided by the Linux kernel, but will see it in the chapter about the [memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) in the Linux kernel.
|
||||
Here we can see `proc_root_readdir` function which will be called when the Linux [VFS](https://en.wikipedia.org/wiki/Virtual_file_system) needs to read the `root` directory contents. If condition marked with `unlikely`, compiler can put `false` code right after branching. Now let's back to the our address check. Comparison between the given address and the `0x00007ffffffff000` will give us to know, was page fault in the kernel mode or user mode. After this check we know it. After this `__do_page_fault` routine will try to understand the problem that provoked page fault exception and then will pass address to the appropriate routine. It can be `kmemcheck` fault, spurious fault, [kprobes](https://www.kernel.org/doc/Documentation/kprobes.txt) fault and etc. Will not dive into implementation details of the page fault exception handler in this part, because we need to know many different concepts which are provided by the Linux kernel, but will see it in the chapter about the [memory management](https://0xax.gitbooks.io/linux-insides/content/MM/index.html) in the Linux kernel.
|
||||
|
||||
Back to start_kernel
|
||||
--------------------------------------------------------------------------------
|
||||
@ -214,7 +214,7 @@ There are many different function calls after the `early_trap_pf_init` in the `s
|
||||
#endif
|
||||
```
|
||||
|
||||
Note that it depends on the `CONFIG_EISA` kernel configuration parameter which represents `EISA` support. Here we use `early_ioremap` function to map `I/O` memory on the page tables. We use `readl` function to read first `4` bytes from the mapped region and if they are equal to `EISA` string we set `EISA_bus` to one. In the end we just unmap previously mapped region. More about `early_ioremap` you can read in the part which describes [Fix-Mapped Addresses and ioremap](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html).
|
||||
Note that it depends on the `CONFIG_EISA` kernel configuration parameter which represents `EISA` support. Here we use `early_ioremap` function to map `I/O` memory on the page tables. We use `readl` function to read first `4` bytes from the mapped region and if they are equal to `EISA` string we set `EISA_bus` to one. In the end we just unmap previously mapped region. More about `early_ioremap` you can read in the part which describes [Fix-Mapped Addresses and ioremap](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-2.html).
|
||||
|
||||
After this we start to fill the `Interrupt Descriptor Table` with the different interrupt gates. First of all we set `#DE` or `Divide Error` and `#NMI` or `Non-maskable Interrupt`:
|
||||
|
||||
@ -223,7 +223,7 @@ set_intr_gate(X86_TRAP_DE, divide_error);
|
||||
set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK);
|
||||
```
|
||||
|
||||
We use `set_intr_gate` macro to set the interrupt gate for the `#DE` exception and `set_intr_gate_ist` for the `#NMI`. You can remember that we already used these macros when we have set the interrupts gates for the page fault handler, debug handler and etc, you can find explanation of it in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html). After this we setup exception gates for the following exceptions:
|
||||
We use `set_intr_gate` macro to set the interrupt gate for the `#DE` exception and `set_intr_gate_ist` for the `#NMI`. You can remember that we already used these macros when we have set the interrupts gates for the page fault handler, debug handler and etc, you can find explanation of it in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-3.html). After this we setup exception gates for the following exceptions:
|
||||
|
||||
```C
|
||||
set_system_intr_gate(X86_TRAP_OF, &overflow);
|
||||
@ -300,7 +300,7 @@ In the next step we fill the `used_vectors` array which defined in the [arch/x86
|
||||
DECLARE_BITMAP(used_vectors, NR_VECTORS);
|
||||
```
|
||||
|
||||
of the first `32` interrupts (more about bitmaps in the Linux kernel you can read in the part which describes [cpumasks and bitmaps](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html))
|
||||
of the first `32` interrupts (more about bitmaps in the Linux kernel you can read in the part which describes [cpumasks and bitmaps](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html))
|
||||
|
||||
```C
|
||||
for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++)
|
||||
@ -329,7 +329,7 @@ __set_fixmap(FIX_RO_IDT, __pa_symbol(idt_table), PAGE_KERNEL_RO);
|
||||
idt_descr.address = fix_to_virt(FIX_RO_IDT);
|
||||
```
|
||||
|
||||
and write its address to the `idt_descr.address` (more about fix-mapped addresses you can read in the second part of the [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html) chapter). After this we can see the call of the `cpu_init` function that defined in the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/cpu/common.c). This function makes initialization of the all `per-cpu` state. In the beginning of the `cpu_init` we do the following things: First of all we wait while current cpu is initialized and than we call the `cr4_init_shadow` function which stores shadow copy of the `cr4` control register for the current cpu and load CPU microcode if need with the following function calls:
|
||||
and write its address to the `idt_descr.address` (more about fix-mapped addresses you can read in the second part of the [Linux kernel memory management](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-2.html) chapter). After this we can see the call of the `cpu_init` function that defined in the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/cpu/common.c). This function makes initialization of the all `per-cpu` state. In the beginning of the `cpu_init` we do the following things: First of all we wait while current cpu is initialized and than we call the `cr4_init_shadow` function which stores shadow copy of the `cr4` control register for the current cpu and load CPU microcode if need with the following function calls:
|
||||
|
||||
```C
|
||||
wait_for_master_cpu(cpu);
|
||||
@ -421,7 +421,7 @@ set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
|
||||
#endif
|
||||
```
|
||||
|
||||
Here we copy `idt_table` to the `nmi_dit_table` and setup exception handlers for the `#DB` or `Debug exception` and `#BR` or `Breakpoint exception`. You can remember that we already set these interrupt gates in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html), so why do we need to setup it again? We setup it again because when we initialized it before in the `early_trap_init` function, the `Task State Segment` was not ready yet, but now it is ready after the call of the `cpu_init` function.
|
||||
Here we copy `idt_table` to the `nmi_dit_table` and setup exception handlers for the `#DB` or `Debug exception` and `#BR` or `Breakpoint exception`. You can remember that we already set these interrupt gates in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-3.html), so why do we need to setup it again? We setup it again because when we initialized it before in the `early_trap_init` function, the `Task State Segment` was not ready yet, but now it is ready after the call of the `cpu_init` function.
|
||||
|
||||
That's all. Soon we will consider all handlers of these interrupts/exceptions.
|
||||
|
||||
@ -448,8 +448,8 @@ Links
|
||||
* [3DNow](https://en.wikipedia.org/?title=3DNow!)
|
||||
* [CPU caches](https://en.wikipedia.org/wiki/CPU_cache)
|
||||
* [VFS](https://en.wikipedia.org/wiki/Virtual_file_system)
|
||||
* [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html)
|
||||
* [Fix-Mapped Addresses and ioremap](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html)
|
||||
* [Linux kernel memory management](https://0xax.gitbooks.io/linux-insides/content/MM/index.html)
|
||||
* [Fix-Mapped Addresses and ioremap](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-2.html)
|
||||
* [Extended Industry Standard Architecture](https://en.wikipedia.org/wiki/Extended_Industry_Standard_Architecture)
|
||||
* [INT isntruction](https://en.wikipedia.org/wiki/INT_%28x86_instruction%29)
|
||||
* [INTO](http://x86.renejeschke.de/html/file_module_x86_id_142.html)
|
||||
@ -459,7 +459,7 @@ Links
|
||||
* [x87 FPU](https://en.wikipedia.org/wiki/X86_instruction_listings#x87_floating-point_instructions)
|
||||
* [MCE exception](https://en.wikipedia.org/wiki/Machine-check_exception)
|
||||
* [SIMD](https://en.wikipedia.org/?title=SIMD)
|
||||
* [cpumasks and bitmaps](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [cpumasks and bitmaps](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html)
|
||||
* [NX](https://en.wikipedia.org/wiki/NX_bit)
|
||||
* [Task State Segment](https://en.wikipedia.org/wiki/Task_state_segment)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-3.html)
|
@ -4,7 +4,7 @@ Interrupts and Interrupt Handling. Part 5.
|
||||
Implementation of exception handlers
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the fifth part about an interrupts and exceptions handling in the Linux kernel and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-4.html) we stopped on the setting of interrupt gates to the [Interrupt descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table). We did it in the `trap_init` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c) source code file. We saw only setting of these interrupt gates in the previous part and in the current part we will see implementation of the exception handlers for these gates. The preparation before an exception handler will be executed is in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/entry_64.S) assembly file and occurs in the [idtentry](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/entry_64.S#L820) macro that defines exceptions entry points:
|
||||
This is the fifth part about an interrupts and exceptions handling in the Linux kernel and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-4.html) we stopped on the setting of interrupt gates to the [Interrupt descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table). We did it in the `trap_init` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c) source code file. We saw only setting of these interrupt gates in the previous part and in the current part we will see implementation of the exception handlers for these gates. The preparation before an exception handler will be executed is in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/entry_64.S) assembly file and occurs in the [idtentry](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/entry_64.S#L820) macro that defines exceptions entry points:
|
||||
|
||||
```assembly
|
||||
idtentry divide_error do_divide_error has_error_code=0
|
||||
@ -62,7 +62,7 @@ native_irq_return_iret:
|
||||
iretq
|
||||
```
|
||||
|
||||
More about the `idtentry` macro you can read in the third part of the [http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html) chapter. Ok, now we saw the preparation before an exception handler will be executed and now time to look on the handlers. First of all let's look on the following handlers:
|
||||
More about the `idtentry` macro you can read in the third part of the [https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-3.html](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-3.html) chapter. Ok, now we saw the preparation before an exception handler will be executed and now time to look on the handlers. First of all let's look on the following handlers:
|
||||
|
||||
* divide_error
|
||||
* overflow
|
||||
@ -211,7 +211,7 @@ static inline void conditional_sti(struct pt_regs *regs)
|
||||
}
|
||||
```
|
||||
|
||||
more about `local_irq_enable` macro you can read in the second [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-2.html) of this chapter. The next and last call in the `do_error_trap` is the `do_trap` function. First of all the `do_trap` function defined the `tsk` variable which has `task_struct` type and represents the current interrupted process. After the definition of the `tsk`, we can see the call of the `do_trap_no_signal` function:
|
||||
more about `local_irq_enable` macro you can read in the second [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-2.html) of this chapter. The next and last call in the `do_error_trap` is the `do_trap` function. First of all the `do_trap` function defined the `tsk` variable which has `task_struct` type and represents the current interrupted process. After the definition of the `tsk`, we can see the call of the `do_trap_no_signal` function:
|
||||
|
||||
```C
|
||||
struct task_struct *tsk = current;
|
||||
@ -463,7 +463,7 @@ That's all.
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the fifth part of the [Interrupts and Interrupt Handling](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and we saw implementation of some interrupt handlers in this part. In the next part we will continue to dive into interrupt and exception handlers and will see handler for the [Non-Maskable Interrupts](https://en.wikipedia.org/wiki/Non-maskable_interrupt), handling of the math [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) and [SIMD](https://en.wikipedia.org/wiki/SIMD) coprocessor exceptions and many many more.
|
||||
It is the end of the fifth part of the [Interrupts and Interrupt Handling](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) chapter and we saw implementation of some interrupt handlers in this part. In the next part we will continue to dive into interrupt and exception handlers and will see handler for the [Non-Maskable Interrupts](https://en.wikipedia.org/wiki/Non-maskable_interrupt), handling of the math [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) and [SIMD](https://en.wikipedia.org/wiki/SIMD) coprocessor exceptions and many many more.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
@ -490,4 +490,4 @@ Links
|
||||
* [x87 FPU](https://en.wikipedia.org/wiki/X87)
|
||||
* [control register](https://en.wikipedia.org/wiki/Control_register)
|
||||
* [MMX](https://en.wikipedia.org/wiki/MMX_%28instruction_set%29)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-4.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-4.html)
|
@ -4,7 +4,7 @@ Interrupts and Interrupt Handling. Part 6.
|
||||
Non-maskable interrupt handler
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is sixth part of the [Interrupts and Interrupt Handling in the Linux kernel](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-5.html) we saw implementation of some exception handlers for the [General Protection Fault](https://en.wikipedia.org/wiki/General_protection_fault) exception, divide exception, invalid [opcode](https://en.wikipedia.org/wiki/Opcode) exceptions and etc. As I wrote in the previous part we will see implementations of the rest exceptions in this part. We will see implementation of the following handlers:
|
||||
It is sixth part of the [Interrupts and Interrupt Handling in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) chapter and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-5.html) we saw implementation of some exception handlers for the [General Protection Fault](https://en.wikipedia.org/wiki/General_protection_fault) exception, divide exception, invalid [opcode](https://en.wikipedia.org/wiki/Opcode) exceptions and etc. As I wrote in the previous part we will see implementations of the rest exceptions in this part. We will see implementation of the following handlers:
|
||||
|
||||
* [Non-Maskable](https://en.wikipedia.org/wiki/Non-maskable_interrupt) interrupt;
|
||||
* [BOUND](http://pdos.csail.mit.edu/6.828/2005/readings/i386/BOUND.htm) Range Exceeded Exception;
|
||||
@ -21,13 +21,13 @@ A [Non-Maskable](https://en.wikipedia.org/wiki/Non-maskable_interrupt) interrupt
|
||||
* External hardware asserts the non-maskable interrupt [pin](https://en.wikipedia.org/wiki/CPU_socket) on the CPU.
|
||||
* The processor receives a message on the system bus or the APIC serial bus with a delivery mode `NMI`.
|
||||
|
||||
When the processor receives a `NMI` from one of these sources, the processor handles it immediately by calling the `NMI` handler pointed to by interrupt vector which has number `2` (see table in the first [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html)). We already filled the [Interrupt Descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table) with the [vector number](https://en.wikipedia.org/wiki/Interrupt_vector_table), address of the `nmi` interrupt handler and `NMI_STACK` [Interrupt Stack Table entry](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/x86/kernel-stacks):
|
||||
When the processor receives a `NMI` from one of these sources, the processor handles it immediately by calling the `NMI` handler pointed to by interrupt vector which has number `2` (see table in the first [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-1.html)). We already filled the [Interrupt Descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table) with the [vector number](https://en.wikipedia.org/wiki/Interrupt_vector_table), address of the `nmi` interrupt handler and `NMI_STACK` [Interrupt Stack Table entry](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/x86/kernel-stacks):
|
||||
|
||||
```C
|
||||
set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK);
|
||||
```
|
||||
|
||||
in the `trap_init` function which defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/traps.c) source code file. In the previous [parts](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) we saw that entry points of the all interrupt handlers are defined with the:
|
||||
in the `trap_init` function which defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/traps.c) source code file. In the previous [parts](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) we saw that entry points of the all interrupt handlers are defined with the:
|
||||
|
||||
```assembly
|
||||
.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
|
||||
@ -75,7 +75,7 @@ The `__KERNEL_CS` macro defined in the [arch/x86/include/asm/segment.h](https://
|
||||
#define __KERNEL_CS (GDT_ENTRY_KERNEL_CS*8)
|
||||
```
|
||||
|
||||
more about `GDT` you can read in the second [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) of the Linux kernel booting process chapter. If `cs` is not kernel segment, it means that it is not nested `NMI` and we jump on the `first_nmi` label. Let's consider this case. First of all we put address of the current stack pointer to the `rdx` and pushes `1` to the stack in the `first_nmi` label:
|
||||
more about `GDT` you can read in the second [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) of the Linux kernel booting process chapter. If `cs` is not kernel segment, it means that it is not nested `NMI` and we jump on the `first_nmi` label. Let's consider this case. First of all we put address of the current stack pointer to the `rdx` and pushes `1` to the stack in the `first_nmi` label:
|
||||
|
||||
```assembly
|
||||
first_nmi:
|
||||
@ -169,7 +169,7 @@ pushq $-1
|
||||
ALLOC_PT_GPREGS_ON_STACK
|
||||
```
|
||||
|
||||
We already saw implementation of the `ALLOC_PT_GREGS_ON_STACK` macro in the third part of the interrupts [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html). This macro defined in the [arch/x86/entry/calling.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/calling.h) and yet another allocates `120` bytes on stack for the general purpose registers, from the `rdi` to the `r15`:
|
||||
We already saw implementation of the `ALLOC_PT_GREGS_ON_STACK` macro in the third part of the interrupts [chapter](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-3.html). This macro defined in the [arch/x86/entry/calling.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/calling.h) and yet another allocates `120` bytes on stack for the general purpose registers, from the `rdi` to the `r15`:
|
||||
|
||||
```assembly
|
||||
.macro ALLOC_PT_GPREGS_ON_STACK addskip=0
|
||||
@ -260,7 +260,7 @@ Now let's look on the `do_nmi` exception handler. This function defined in the [
|
||||
* address of the `pt_regs`;
|
||||
* error code.
|
||||
|
||||
as all exception handlers. The `do_nmi` starts from the call of the `nmi_nesting_preprocess` function and ends with the call of the `nmi_nesting_postprocess`. The `nmi_nesting_preprocess` function checks that we likely do not work with the debug stack and if we on the debug stack set the `update_debug_stack` [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable to `1` and call the `debug_stack_set_zero` function from the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/cpu/common.c). This function increases the `debug_stack_use_ctr` per-cpu variable and loads new `Interrupt Descriptor Table`:
|
||||
as all exception handlers. The `do_nmi` starts from the call of the `nmi_nesting_preprocess` function and ends with the call of the `nmi_nesting_postprocess`. The `nmi_nesting_preprocess` function checks that we likely do not work with the debug stack and if we on the debug stack set the `update_debug_stack` [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) variable to `1` and call the `debug_stack_set_zero` function from the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/cpu/common.c). This function increases the `debug_stack_use_ctr` per-cpu variable and loads new `Interrupt Descriptor Table`:
|
||||
|
||||
```C
|
||||
static inline void nmi_nesting_preprocess(struct pt_regs *regs)
|
||||
@ -325,7 +325,7 @@ exception_exit(prev_state);
|
||||
return;
|
||||
```
|
||||
|
||||
After we have got the state of the previous context, we add the exception to the `notify_die` chain and if it will return `NOTIFY_STOP` we return from the exception. More about notify chains and the `context tracking` functions you can read in the [previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-5.html). In the next step we enable interrupts if they were disabled with the `contidional_sti` function that checks `IF` flag and call the `local_irq_enable` depends on its value:
|
||||
After we have got the state of the previous context, we add the exception to the `notify_die` chain and if it will return `NOTIFY_STOP` we return from the exception. More about notify chains and the `context tracking` functions you can read in the [previous part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-5.html). In the next step we enable interrupts if they were disabled with the `contidional_sti` function that checks `IF` flag and call the `local_irq_enable` depends on its value:
|
||||
|
||||
```C
|
||||
conditional_sti(regs);
|
||||
@ -446,7 +446,7 @@ That's all.
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the sixth part of the [Interrupts and Interrupt Handling](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and we saw implementation of some exception handlers in this part, like `non-maskable` interrupt, [SIMD](https://en.wikipedia.org/wiki/SIMD) and [x87 FPU](https://en.wikipedia.org/wiki/X87) floating point exception. Finally we have finsihed with the `trap_init` function in this part and will go ahead in the next part. The next our point is the external interrupts and the `early_irq_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c).
|
||||
It is the end of the sixth part of the [Interrupts and Interrupt Handling](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) chapter and we saw implementation of some exception handlers in this part, like `non-maskable` interrupt, [SIMD](https://en.wikipedia.org/wiki/SIMD) and [x87 FPU](https://en.wikipedia.org/wiki/X87) floating point exception. Finally we have finsihed with the `trap_init` function in this part and will go ahead in the next part. The next our point is the external interrupts and the `early_irq_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c).
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
@ -473,8 +473,8 @@ Links
|
||||
* [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table)
|
||||
* [stack frame](https://en.wikipedia.org/wiki/Call_stack)
|
||||
* [Model Specific regiser](https://en.wikipedia.org/wiki/Model-specific_register)
|
||||
* [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html)
|
||||
* [RCU](https://en.wikipedia.org/wiki/Read-copy-update)
|
||||
* [MPX](https://en.wikipedia.org/wiki/Intel_MPX)
|
||||
* [x87 FPU](https://en.wikipedia.org/wiki/X87)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-5.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-5.html)
|
@ -4,7 +4,7 @@ Interrupts and Interrupt Handling. Part 7.
|
||||
Introduction to external interrupts
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the seventh part of the Interrupts and Interrupt Handling in the Linux kernel [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-6.html) we have finished with the exceptions which are generated by the processor. In this part we will continue to dive to the interrupt handling and will start with the external hardware interrupt handling. As you can remember, in the previous part we have finished with the `trap_init` function from the [arch/x86/kernel/trap.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/traps.c) and the next step is the call of the `early_irq_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c).
|
||||
This is the seventh part of the Interrupts and Interrupt Handling in the Linux kernel [chapter](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-6.html) we have finished with the exceptions which are generated by the processor. In this part we will continue to dive to the interrupt handling and will start with the external hardware interrupt handling. As you can remember, in the previous part we have finished with the `trap_init` function from the [arch/x86/kernel/trap.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/traps.c) and the next step is the call of the `early_irq_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c).
|
||||
|
||||
Interrupts are signal that are sent across [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) or `Interrupt Request Line` by a hardware or software. External hardware interrupts allow devices like keyboard, mouse and etc, to indicate that it needs attention of the processor. Once the processor receives the `Interrupt Request`, it will temporary stop execution of the running program and invoke special routine which depends on an interrupt. We already know that this routine is called interrupt handler (or how we will call it `ISR` or `Interrupt Service Routine` from this part). The `ISR` or `Interrupt Handler Routine` can be found in Interrupt Vector table that is located at fixed address in the memory. After the interrupt is handled processor resumes the interrupted process. At the boot/initialization time, the Linux kernel identifies all devices in the machine, and appropriate interrupt handlers are loaded into the interrupt table. As we saw in the previous parts, most exceptions are handled simply by the sending a [Unix signal](https://en.wikipedia.org/wiki/Unix_signal) to the interrupted process. That's why kernel is can handle an exception quickly. Unfortunately we can not use this approach for the external hardware interrupts, because often they arrive after (and sometimes long after) the process to which they are related has been suspended. So it would make no sense to send a Unix signal to the current process. External interrupt handling depends on the type of an interrupt:
|
||||
|
||||
@ -95,7 +95,7 @@ More about this will be in the another chapter about the `NUMA`. The next step a
|
||||
init_irq_default_affinity();
|
||||
```
|
||||
|
||||
function. The `init_irq_default_affinity` function defined in the same source code file and depends on the `CONFIG_SMP` kernel configuration option allocates a given [cpumask](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) structure (in our case it is the `irq_default_affinity`):
|
||||
function. The `init_irq_default_affinity` function defined in the same source code file and depends on the `CONFIG_SMP` kernel configuration option allocates a given [cpumask](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) structure (in our case it is the `irq_default_affinity`):
|
||||
|
||||
```C
|
||||
#if defined(CONFIG_SMP)
|
||||
@ -207,7 +207,7 @@ for (i = 0; i < count; i++) {
|
||||
|
||||
We are going through the all interrupt descriptors and do the following things:
|
||||
|
||||
First of all we allocate [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable for the `irq` kernel statistic with the `alloc_percpu` macro. This macro allocates one instance of an object of the given type for every processor on the system. You can access kernel statistic from the userspace via `/proc/stat`:
|
||||
First of all we allocate [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) variable for the `irq` kernel statistic with the `alloc_percpu` macro. This macro allocates one instance of an object of the given type for every processor on the system. You can access kernel statistic from the userspace via `/proc/stat`:
|
||||
|
||||
```
|
||||
~$ cat /proc/stat
|
||||
@ -221,7 +221,7 @@ cpu3 26648 8 6931 678891 414 0 244 0 0 0
|
||||
...
|
||||
```
|
||||
|
||||
Where the sixth column is the servicing interrupts. After this we allocate [cpumask](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) for the given irq descriptor affinity and initialize the [spinlock](https://en.wikipedia.org/wiki/Spinlock) for the given interrupt descriptor. After this before the [critical section](https://en.wikipedia.org/wiki/Critical_section), the lock will be acquired with a call of the `raw_spin_lock` and unlocked with the call of the `raw_spin_unlock`. In the next step we call the `lockdep_set_class` macro which set the [Lock validator](https://lwn.net/Articles/185666/) `irq_desc_lock_class` class for the lock of the given interrupt descriptor. More about `lockdep`, `spinlock` and other synchronization primitives will be described in the separate chapter.
|
||||
Where the sixth column is the servicing interrupts. After this we allocate [cpumask](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) for the given irq descriptor affinity and initialize the [spinlock](https://en.wikipedia.org/wiki/Spinlock) for the given interrupt descriptor. After this before the [critical section](https://en.wikipedia.org/wiki/Critical_section), the lock will be acquired with a call of the `raw_spin_lock` and unlocked with the call of the `raw_spin_unlock`. In the next step we call the `lockdep_set_class` macro which set the [Lock validator](https://lwn.net/Articles/185666/) `irq_desc_lock_class` class for the lock of the given interrupt descriptor. More about `lockdep`, `spinlock` and other synchronization primitives will be described in the separate chapter.
|
||||
|
||||
In the end of the loop we call the `desc_set_defaults` function from the [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/irq/irqdesc.c). This function takes four parameters:
|
||||
|
||||
@ -275,7 +275,7 @@ desc->owner = owner;
|
||||
...
|
||||
```
|
||||
|
||||
After this we go through the all [possible](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) processor with the [for_each_possible_cpu](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/cpumask.h#L714) helper and set the `kstat_irqs` to zero for the given interrupt descriptor:
|
||||
After this we go through the all [possible](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) processor with the [for_each_possible_cpu](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/cpumask.h#L714) helper and set the `kstat_irqs` to zero for the given interrupt descriptor:
|
||||
|
||||
```C
|
||||
for_each_possible_cpu(cpu)
|
||||
@ -367,7 +367,7 @@ if (nr_irqs > (NR_VECTORS * nr_cpu_ids))
|
||||
nr = (gsi_top + nr_legacy_irqs()) + 8 * nr_cpu_ids;
|
||||
```
|
||||
|
||||
Take a look on the `gsi_top` variable. Each `APIC` is identified with its own `ID` and with the offset where its `IRQ` starts. It is called `GSI` base or `Global System Interrupt` base. So the `gsi_top` represents it. We get the `Global System Interrupt` base from the [MultiProcessor Configuration Table](https://en.wikipedia.org/wiki/MultiProcessor_Specification) table (you can remember that we have parsed this table in the sixth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the Linux Kernel initialization process chapter).
|
||||
Take a look on the `gsi_top` variable. Each `APIC` is identified with its own `ID` and with the offset where its `IRQ` starts. It is called `GSI` base or `Global System Interrupt` base. So the `gsi_top` represents it. We get the `Global System Interrupt` base from the [MultiProcessor Configuration Table](https://en.wikipedia.org/wiki/MultiProcessor_Specification) table (you can remember that we have parsed this table in the sixth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the Linux Kernel initialization process chapter).
|
||||
|
||||
After this we update the `nr` depends on the value of the `gsi_top`:
|
||||
|
||||
@ -413,7 +413,7 @@ if (WARN_ON(initcnt > IRQ_BITMAP_BITS))
|
||||
initcnt = IRQ_BITMAP_BITS;
|
||||
```
|
||||
|
||||
where `IRQ_BITMAP_BITS` is equal to the `NR_IRQS` if the `CONFIG_SPARSE_IRQ` is not set and `NR_IRQS + 8196` in other way. In the next step we are going over all interrupt descriptors which need to be allocated in the loop and allocate space for the descriptor and insert to the `irq_desc_tree` [radix tree](http://0xax.gitbooks.io/linux-insides/content/DataStructures/radix-tree.html):
|
||||
where `IRQ_BITMAP_BITS` is equal to the `NR_IRQS` if the `CONFIG_SPARSE_IRQ` is not set and `NR_IRQS + 8196` in other way. In the next step we are going over all interrupt descriptors which need to be allocated in the loop and allocate space for the descriptor and insert to the `irq_desc_tree` [radix tree](https://0xax.gitbooks.io/linux-insides/content/DataStructures/linux-datastructures-2.html):
|
||||
|
||||
```C
|
||||
for (i = 0; i < initcnt; i++) {
|
||||
@ -434,7 +434,7 @@ That's all.
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the seventh part of the [Interrupts and Interrupt Handling](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and we started to dive into external hardware interrupts in this part. We saw early initialization of the `irq_desc` structure which represents description of an external interrupt and contains information about it like list of irq actions, information about interrupt handler, interrupt's owner, count of the unhandled interrupt and etc. In the next part we will continue to research external interrupts.
|
||||
It is the end of the seventh part of the [Interrupts and Interrupt Handling](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) chapter and we started to dive into external hardware interrupts in this part. We saw early initialization of the `irq_desc` structure which represents description of an external interrupt and contains information about it like list of irq actions, information about interrupt handler, interrupt's owner, count of the unhandled interrupt and etc. In the next part we will continue to research external interrupts.
|
||||
|
||||
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
@ -446,8 +446,8 @@ Links
|
||||
* [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
|
||||
* [numa](https://en.wikipedia.org/wiki/Non-uniform_memory_access)
|
||||
* [Enum type](https://en.wikipedia.org/wiki/Enumerated_type)
|
||||
* [cpumask](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [cpumask](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html)
|
||||
* [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html)
|
||||
* [spinlock](https://en.wikipedia.org/wiki/Spinlock)
|
||||
* [critical section](https://en.wikipedia.org/wiki/Critical_section)
|
||||
* [Lock validator](https://lwn.net/Articles/185666/)
|
||||
@ -457,5 +457,5 @@ Links
|
||||
* [Intel 8259](https://en.wikipedia.org/wiki/Intel_8259)
|
||||
* [PIC](https://en.wikipedia.org/wiki/Programmable_Interrupt_Controller)
|
||||
* [MultiProcessor Configuration Table](https://en.wikipedia.org/wiki/MultiProcessor_Specification)
|
||||
* [radix tree](http://0xax.gitbooks.io/linux-insides/content/DataStructures/radix-tree.html)
|
||||
* [radix tree](https://0xax.gitbooks.io/linux-insides/content/DataStructures/linux-datastructures-2.html)
|
||||
* [dmesg](https://en.wikipedia.org/wiki/Dmesg)
|
@ -4,9 +4,9 @@ Interrupts and Interrupt Handling. Part 8.
|
||||
Non-early initialization of the IRQs
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the eighth part of the Interrupts and Interrupt Handling in the Linux kernel [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-7.html) we started to dive into the external hardware [interrupts](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29). We looked on the implementation of the `early_irq_init` function from the [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/irq/irqdesc.c) source code file and saw the initialization of the `irq_desc` structure in this function. Remind that `irq_desc` structure (defined in the [include/linux/irqdesc.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/irqdesc.h#L46) is the foundation of interrupt management code in the Linux kernel and represents an interrupt descriptor. In this part we will continue to dive into the initialization stuff which is related to the external hardware interrupts.
|
||||
This is the eighth part of the Interrupts and Interrupt Handling in the Linux kernel [chapter](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-7.html) we started to dive into the external hardware [interrupts](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29). We looked on the implementation of the `early_irq_init` function from the [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/irq/irqdesc.c) source code file and saw the initialization of the `irq_desc` structure in this function. Remind that `irq_desc` structure (defined in the [include/linux/irqdesc.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/irqdesc.h#L46) is the foundation of interrupt management code in the Linux kernel and represents an interrupt descriptor. In this part we will continue to dive into the initialization stuff which is related to the external hardware interrupts.
|
||||
|
||||
Right after the call of the `early_irq_init` function in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) we can see the call of the `init_IRQ` function. This function is architecture-specific and defined in the [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/irqinit.c). The `init_IRQ` function makes initialization of the `vector_irq` [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable that defined in the same [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/irqinit.c) source code file:
|
||||
Right after the call of the `early_irq_init` function in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) we can see the call of the `init_IRQ` function. This function is architecture-specific and defined in the [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/irqinit.c). The `init_IRQ` function makes initialization of the `vector_irq` [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) variable that defined in the same [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/irqinit.c) source code file:
|
||||
|
||||
```C
|
||||
...
|
||||
@ -22,13 +22,13 @@ and represents `percpu` array of the interrupt vector numbers. The `vector_irq_t
|
||||
typedef int vector_irq_t[NR_VECTORS];
|
||||
```
|
||||
|
||||
where `NR_VECTORS` is count of the vector number and as you can remember from the first [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) of this chapter it is `256` for the [x86_64](https://en.wikipedia.org/wiki/X86-64):
|
||||
where `NR_VECTORS` is count of the vector number and as you can remember from the first [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-1.html) of this chapter it is `256` for the [x86_64](https://en.wikipedia.org/wiki/X86-64):
|
||||
|
||||
```C
|
||||
#define NR_VECTORS 256
|
||||
```
|
||||
|
||||
So, in the start of the `init_IRQ` function we fill the `vector_irq` [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) array with the vector number of the `legacy` interrupts:
|
||||
So, in the start of the `init_IRQ` function we fill the `vector_irq` [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) array with the vector number of the `legacy` interrupts:
|
||||
|
||||
```C
|
||||
void __init init_IRQ(void)
|
||||
@ -105,7 +105,7 @@ In the loop we are accessing the `vecto_irq` per-cpu array with the `per_cpu` ma
|
||||
#define IRQ0_VECTOR ((FIRST_EXTERNAL_VECTOR + 16) & ~15)
|
||||
```
|
||||
|
||||
Why is `0x30` here? You can remember from the first [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) of this chapter that first 32 vector numbers from `0` to `31` are reserved by the processor and used for the processing of architecture-defined exceptions and interrupts. Vector numbers from `0x30` to `0x3f` are reserved for the [ISA](https://en.wikipedia.org/wiki/Industry_Standard_Architecture). So, it means that we fill the `vector_irq` from the `IRQ0_VECTOR` which is equal to the `32` to the `IRQ0_VECTOR + 16` (before the `0x30`).
|
||||
Why is `0x30` here? You can remember from the first [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-1.html) of this chapter that first 32 vector numbers from `0` to `31` are reserved by the processor and used for the processing of architecture-defined exceptions and interrupts. Vector numbers from `0x30` to `0x3f` are reserved for the [ISA](https://en.wikipedia.org/wiki/Industry_Standard_Architecture). So, it means that we fill the `vector_irq` from the `IRQ0_VECTOR` which is equal to the `32` to the `IRQ0_VECTOR + 16` (before the `0x30`).
|
||||
|
||||
In the end of the `init_IRQ` function we can see the call of the following function:
|
||||
|
||||
@ -113,7 +113,7 @@ In the end of the `init_IRQ` function we can see the call of the following funct
|
||||
x86_init.irqs.intr_init();
|
||||
```
|
||||
|
||||
from the [arch/x86/kernel/x86_init.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/x86_init.c) source code file. If you have read [chapter](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) about the Linux kernel initialization process, you can remember the `x86_init` structure. This structure contains a couple of files which are points to the function related to the platform setup (`x86_64` in our case), for example `resources` - related with the memory resources, `mpparse` - related with the parsing of the [MultiProcessor Configuration Table](https://en.wikipedia.org/wiki/MultiProcessor_Specification) table and etc.). As we can see the `x86_init` also contains the `irqs` field which contains three following fields:
|
||||
from the [arch/x86/kernel/x86_init.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/x86_init.c) source code file. If you have read [chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) about the Linux kernel initialization process, you can remember the `x86_init` structure. This structure contains a couple of files which are points to the function related to the platform setup (`x86_64` in our case), for example `resources` - related with the memory resources, `mpparse` - related with the parsing of the [MultiProcessor Configuration Table](https://en.wikipedia.org/wiki/MultiProcessor_Specification) table and etc.). As we can see the `x86_init` also contains the `irqs` field which contains three following fields:
|
||||
|
||||
```C
|
||||
struct x86_init_ops x86_init __initdata
|
||||
@ -179,7 +179,7 @@ After this depends on the `CONFIG_X86_64` and `CONFIG_X86_LOCAL_APIC` kernel con
|
||||
#endif
|
||||
```
|
||||
|
||||
This function makes initialization of the [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) of `bootstrap processor` (or processor which starts first). It starts from the check that we found [SMP](https://en.wikipedia.org/wiki/Symmetric_multiprocessing) config (read more about it in the sixth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the Linux kernel initialization process chapter) and the processor has `APIC`:
|
||||
This function makes initialization of the [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) of `bootstrap processor` (or processor which starts first). It starts from the check that we found [SMP](https://en.wikipedia.org/wiki/Symmetric_multiprocessing) config (read more about it in the sixth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the Linux kernel initialization process chapter) and the processor has `APIC`:
|
||||
|
||||
```C
|
||||
if (smp_found_config || !cpu_has_apic)
|
||||
@ -241,7 +241,7 @@ do { \
|
||||
} while (0)
|
||||
```
|
||||
|
||||
As we can see, first of all it expands to the call of the `alloc_system_vector` function that checks the given vector number in the `used_vectors` bitmap (read previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-7.html) about it) and if it is not set in the `used_vectors` bitmap we set it. After this we test that the `first_system_vector` is greater than given interrupt vector number and if it is greater we assign it:
|
||||
As we can see, first of all it expands to the call of the `alloc_system_vector` function that checks the given vector number in the `used_vectors` bitmap (read previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-7.html) about it) and if it is not set in the `used_vectors` bitmap we set it. After this we test that the `first_system_vector` is greater than given interrupt vector number and if it is greater we assign it:
|
||||
|
||||
```C
|
||||
if (!test_bit(vector, used_vectors)) {
|
||||
@ -399,7 +399,7 @@ for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++)
|
||||
set_bit(i, used_vectors);
|
||||
```
|
||||
|
||||
You can remember how we did it in the sixth [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-6.html) of this chapter.
|
||||
You can remember how we did it in the sixth [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-6.html) of this chapter.
|
||||
|
||||
In the end of the `native_init_IRQ` function we can see the following check:
|
||||
|
||||
@ -509,7 +509,7 @@ That's all.
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the eighth part of the [Interrupts and Interrupt Handling](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and we continued to dive into external hardware interrupts in this part. In the previous part we started to do it and saw early initialization of the `IRQs`. In this part we already saw non-early interrupts initialization in the `init_IRQ` function. We saw initialization of the `vector_irq` per-cpu array which is store vector numbers of the interrupts and will be used during interrupt handling and initialization of other stuff which is related to the external hardware interrupts.
|
||||
It is the end of the eighth part of the [Interrupts and Interrupt Handling](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) chapter and we continued to dive into external hardware interrupts in this part. In the previous part we started to do it and saw early initialization of the `IRQs`. In this part we already saw non-early interrupts initialization in the `init_IRQ` function. We saw initialization of the `vector_irq` per-cpu array which is store vector numbers of the interrupts and will be used during interrupt handling and initialization of other stuff which is related to the external hardware interrupts.
|
||||
|
||||
In the next part we will continue to learn interrupts handling related stuff and will see initialization of the `softirqs`.
|
||||
|
||||
@ -521,7 +521,7 @@ Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
|
||||
* [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [Intel 8259](https://en.wikipedia.org/wiki/Intel_8259)
|
||||
* [Programmable Interrupt Controller](https://en.wikipedia.org/wiki/Programmable_Interrupt_Controller)
|
||||
@ -539,4 +539,4 @@ Links
|
||||
* [Open Firmware](https://en.wikipedia.org/wiki/Open_Firmware)
|
||||
* [devicetree](https://en.wikipedia.org/wiki/Device_tree)
|
||||
* [RTC](https://en.wikipedia.org/wiki/Real-time_clock)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-7.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-7.html)
|
@ -4,7 +4,7 @@ Interrupts and Interrupt Handling. Part 9.
|
||||
Introduction to deferred interrupts (Softirq, Tasklets and Workqueues)
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the nine part of the Interrupts and Interrupt Handling in the Linux kernel [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) and in the previous [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-8.html) we saw implementation of the `init_IRQ` from that defined in the [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/irqinit.c) source code file. So, we will continue to dive into the initialization stuff which is related to the external hardware interrupts in this part.
|
||||
It is the nine part of the Interrupts and Interrupt Handling in the Linux kernel [chapter](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) and in the previous [Previous part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-8.html) we saw implementation of the `init_IRQ` from that defined in the [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/irqinit.c) source code file. So, we will continue to dive into the initialization stuff which is related to the external hardware interrupts in this part.
|
||||
|
||||
Interrupts may have different important characteristics and there are two among them:
|
||||
|
||||
@ -145,7 +145,7 @@ The `raise_softirq_irqoff` function marks the softirq as deffered by setting the
|
||||
__raise_softirq_irqoff(nr);
|
||||
```
|
||||
|
||||
macro. After this, it checks the result of the `in_interrupt` that returns `irq_count` value. We already saw the `irq_count` in the first [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) of this chapter and it is used to check if a CPU is already on an interrupt stack or not. We just exit from the `raise_softirq_irqoff`, restore `IF` flag and enable interrupts on the local processor, if we are in the interrupt context, otherwise we call the `wakeup_softirqd`:
|
||||
macro. After this, it checks the result of the `in_interrupt` that returns `irq_count` value. We already saw the `irq_count` in the first [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-1.html) of this chapter and it is used to check if a CPU is already on an interrupt stack or not. We just exit from the `raise_softirq_irqoff`, restore `IF` flag and enable interrupts on the local processor, if we are in the interrupt context, otherwise we call the `wakeup_softirqd`:
|
||||
|
||||
```C
|
||||
if (!in_interrupt())
|
||||
@ -227,7 +227,7 @@ void __init softirq_init(void)
|
||||
}
|
||||
```
|
||||
|
||||
We can see definition of the integer `cpu` variable at the beginning of the `softirq_init` function. Next we will use it as parameter for the `for_each_possible_cpu` macro that goes through the all possible processors in the system. If the `possible processor` is the new terminology for you, you can read more about it the [CPU masks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) chapter. In short words, `possible cpus` is the set of processors that can be plugged in anytime during the life of that system boot. All `possible processors` stored in the `cpu_possible_bits` bitmap, you can find its definition in the [kernel/cpu.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/cpu.c):
|
||||
We can see definition of the integer `cpu` variable at the beginning of the `softirq_init` function. Next we will use it as parameter for the `for_each_possible_cpu` macro that goes through the all possible processors in the system. If the `possible processor` is the new terminology for you, you can read more about it the [CPU masks](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) chapter. In short words, `possible cpus` is the set of processors that can be plugged in anytime during the life of that system boot. All `possible processors` stored in the `cpu_possible_bits` bitmap, you can find its definition in the [kernel/cpu.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/cpu.c):
|
||||
|
||||
```C
|
||||
static DECLARE_BITMAP(cpu_possible_bits, CONFIG_NR_CPUS) __read_mostly;
|
||||
@ -237,7 +237,7 @@ static DECLARE_BITMAP(cpu_possible_bits, CONFIG_NR_CPUS) __read_mostly;
|
||||
const struct cpumask *const cpu_possible_mask = to_cpumask(cpu_possible_bits);
|
||||
```
|
||||
|
||||
Ok, we defined the integer `cpu` variable and go through the all possible processors with the `for_each_possible_cpu` macro and makes initialization of the two following [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variables:
|
||||
Ok, we defined the integer `cpu` variable and go through the all possible processors with the `for_each_possible_cpu` macro and makes initialization of the two following [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) variables:
|
||||
|
||||
* `tasklet_vec`;
|
||||
* `tasklet_hi_vec`;
|
||||
@ -364,7 +364,7 @@ static void tasklet_action(struct softirq_action *a)
|
||||
}
|
||||
```
|
||||
|
||||
In the beginning of the `tasklet_action` function, we disable interrupts for the local processor with the help of the `local_irq_disable` macro (you can read about this macro in the second [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-2.html) of this chapter). In the next step, we take a head of the list that contains tasklets with normal priority and set this per-cpu list to `NULL` because all tasklets must be executed in a generally way. After this we enable interrupts for the local processor and go through the list of tasklets in the loop. In every iteration of the loop we call the `tasklet_trylock` function for the given tasklet that updates state of the given tasklet on `TASKLET_STATE_RUN`:
|
||||
In the beginning of the `tasklet_action` function, we disable interrupts for the local processor with the help of the `local_irq_disable` macro (you can read about this macro in the second [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-2.html) of this chapter). In the next step, we take a head of the list that contains tasklets with normal priority and set this per-cpu list to `NULL` because all tasklets must be executed in a generally way. After this we enable interrupts for the local processor and go through the list of tasklets in the loop. In every iteration of the loop we call the `tasklet_trylock` function for the given tasklet that updates state of the given tasklet on `TASKLET_STATE_RUN`:
|
||||
|
||||
```C
|
||||
static inline int tasklet_trylock(struct tasklet_struct *t)
|
||||
@ -477,7 +477,7 @@ bool queue_work_on(int cpu, struct workqueue_struct *wq,
|
||||
}
|
||||
```
|
||||
|
||||
The `__queue_work` function gets the `work pool`. Yes, the `work pool` not `workqueue`. Actually, all `works` are not placed in the `workqueue`, but to the `work pool` that is represented by the `worker_pool` structure in the Linux kernel. As you can see above, the `workqueue_struct` structure has the `pwqs` field which is list of `worker_pools`. When we create a `workqueue`, it stands out for each processor the `pool_workqueue`. Each `pool_workqueue` associated with `worker_pool`, which is allocated on the same processor and corresponds to the type of priority queue. Through them `workqueue` interacts with `worker_pool`. So in the `__queue_work` function we set the cpu to the current processor with the `raw_smp_processor_id` (you can find information about this macro in the fourth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process chapter), getting the `pool_workqueue` for the given `workqueue_struct` and insert the given `work` to the given `workqueue`:
|
||||
The `__queue_work` function gets the `work pool`. Yes, the `work pool` not `workqueue`. Actually, all `works` are not placed in the `workqueue`, but to the `work pool` that is represented by the `worker_pool` structure in the Linux kernel. As you can see above, the `workqueue_struct` structure has the `pwqs` field which is list of `worker_pools`. When we create a `workqueue`, it stands out for each processor the `pool_workqueue`. Each `pool_workqueue` associated with `worker_pool`, which is allocated on the same processor and corresponds to the type of priority queue. Through them `workqueue` interacts with `worker_pool`. So in the `__queue_work` function we set the cpu to the current processor with the `raw_smp_processor_id` (you can find information about this macro in the fourth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process chapter), getting the `pool_workqueue` for the given `workqueue_struct` and insert the given `work` to the given `workqueue`:
|
||||
|
||||
```C
|
||||
static void __queue_work(int cpu, struct workqueue_struct *wq,
|
||||
@ -506,7 +506,7 @@ That's all.
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the ninth part of the [Interrupts and Interrupt Handling](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and we continued to dive into external hardware interrupts in this part. In the previous part we saw initialization of the `IRQs` and main `irq_desc` structure. In this part we saw three concepts: the `softirq`, `tasklet` and `workqueue` that are used for the deferred functions.
|
||||
It is the end of the ninth part of the [Interrupts and Interrupt Handling](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) chapter and we continued to dive into external hardware interrupts in this part. In the previous part we saw initialization of the `IRQs` and main `irq_desc` structure. In this part we saw three concepts: the `softirq`, `tasklet` and `workqueue` that are used for the deferred functions.
|
||||
|
||||
The next part will be last part of the `Interrupts and Interrupt Handling` chapter and we will look on the real hardware driver and will try to learn how it works with the interrupts subsystem.
|
||||
|
||||
@ -520,7 +520,7 @@ Links
|
||||
* [initcall](http://www.compsoc.man.ac.uk/~moz/kernelnewbies/documents/initcall/index.html)
|
||||
* [IF](https://en.wikipedia.org/wiki/Interrupt_flag)
|
||||
* [eflags](https://en.wikipedia.org/wiki/FLAGS_register)
|
||||
* [CPU masks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [CPU masks](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html)
|
||||
* [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html)
|
||||
* [Workqueue](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/workqueue.txt)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-8.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-8.html)
|
@ -4,13 +4,13 @@ Linux kernel memory management Part 1.
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Memory management is one of the most complex (and I think that it is the most complex) part of the operating system kernel. In the [last preparations before the kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html) part we stopped right before call of the `start_kernel` function. This function initializes all the kernel features (including architecture-dependent features) before the kernel runs the first `init` process. You may remember as we built early page tables, identity page tables and fixmap page tables in the boot time. No complicated memory management is working yet. When the `start_kernel` function is called we will see the transition to more complex data structures and techniques for memory management. For a good understanding of the initialization process in the linux kernel we need to have a clear understanding of these techniques. This chapter will provide an overview of the different parts of the linux kernel memory management framework and its API, starting from the `memblock`.
|
||||
Memory management is one of the most complex (and I think that it is the most complex) part of the operating system kernel. In the [last preparations before the kernel entry point](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html) part we stopped right before call of the `start_kernel` function. This function initializes all the kernel features (including architecture-dependent features) before the kernel runs the first `init` process. You may remember as we built early page tables, identity page tables and fixmap page tables in the boot time. No complicated memory management is working yet. When the `start_kernel` function is called we will see the transition to more complex data structures and techniques for memory management. For a good understanding of the initialization process in the linux kernel we need to have a clear understanding of these techniques. This chapter will provide an overview of the different parts of the linux kernel memory management framework and its API, starting from the `memblock`.
|
||||
|
||||
Memblock
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Memblock is one of the methods of managing memory regions during the early bootstrap period while the usual kernel memory allocators are not up and
|
||||
running yet. Previously it was called `Logical Memory Block`, but with the [patch](https://lkml.org/lkml/2010/7/13/68) by Yinghai Lu, it was renamed to the `memblock`. As Linux kernel for `x86_64` architecture uses this method. We already met `memblock` in the [Last preparations before the kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html) part. And now it's time to get acquainted with it closer. We will see how it is implemented.
|
||||
running yet. Previously it was called `Logical Memory Block`, but with the [patch](https://lkml.org/lkml/2010/7/13/68) by Yinghai Lu, it was renamed to the `memblock`. As Linux kernel for `x86_64` architecture uses this method. We already met `memblock` in the [Last preparations before the kernel entry point](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html) part. And now it's time to get acquainted with it closer. We will see how it is implemented.
|
||||
|
||||
We will start to learn `memblock` from the data structures. Definitions of all logical-memory-block-related data structures can be found in the [include/linux/memblock.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/memblock.h) header file.
|
||||
|
||||
@ -163,7 +163,7 @@ This function takes a physical base address and the size of the memory region as
|
||||
memblock_add_range(&memblock.memory, base, size, MAX_NUMNODES, 0);
|
||||
```
|
||||
|
||||
function. We pass the memory block type - `memory`, the physical base address and the size of the memory region, the maximum number of nodes which is 1 if `CONFIG_NODES_SHIFT` is not set in the configuration file or `1 << CONFIG_NODES_SHIFT` if it is set, and the flags. The `memblock_add_range` function adds a new memory region to the memory block. It starts by checking the size of the given region and if it is zero it just returns. After this, `memblock_add_range` checks the existence of the memory regions in the `memblock` structure with the given `memblock_type`. If there are no memory regions, we just fill new a `memory_region` with the given values and return (we already saw the implementation of this in the [First touch of the linux kernel memory manager framework](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html)). If `memblock_type` is not empty, we start to add a new memory region to the `memblock` with the given `memblock_type`.
|
||||
function. We pass the memory block type - `memory`, the physical base address and the size of the memory region, the maximum number of nodes which is 1 if `CONFIG_NODES_SHIFT` is not set in the configuration file or `1 << CONFIG_NODES_SHIFT` if it is set, and the flags. The `memblock_add_range` function adds a new memory region to the memory block. It starts by checking the size of the given region and if it is zero it just returns. After this, `memblock_add_range` checks the existence of the memory regions in the `memblock` structure with the given `memblock_type`. If there are no memory regions, we just fill new a `memory_region` with the given values and return (we already saw the implementation of this in the [First touch of the linux kernel memory manager framework](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html)). If `memblock_type` is not empty, we start to add a new memory region to the `memblock` with the given `memblock_type`.
|
||||
|
||||
First of all we get the end of the memory region with the:
|
||||
|
||||
@ -420,4 +420,4 @@ Links
|
||||
* [e820](http://en.wikipedia.org/wiki/E820)
|
||||
* [numa](http://en.wikipedia.org/wiki/Non-uniform_memory_access)
|
||||
* [debugfs](http://en.wikipedia.org/wiki/Debugfs)
|
||||
* [First touch of the linux kernel memory manager framework](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html)
|
||||
* [First touch of the linux kernel memory manager framework](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html)
|
@ -4,7 +4,7 @@ Linux kernel memory management Part 2.
|
||||
Fix-Mapped Addresses and ioremap
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
`Fix-Mapped` addresses are a set of special compile-time addresses whose corresponding physical addresses do not have to be a linear address minus `__START_KERNEL_map`. Each fix-mapped address maps one page frame and the kernel uses them as pointers that never change their address. That is the main point of these addresses. As the comment says: `to have a constant address at compile time, but to set the physical address only in the boot process`. You can remember that in the earliest [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html), we already set the `level2_fixmap_pgt`:
|
||||
`Fix-Mapped` addresses are a set of special compile-time addresses whose corresponding physical addresses do not have to be a linear address minus `__START_KERNEL_map`. Each fix-mapped address maps one page frame and the kernel uses them as pointers that never change their address. That is the main point of these addresses. As the comment says: `to have a constant address at compile time, but to set the physical address only in the boot process`. You can remember that in the earliest [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html), we already set the `level2_fixmap_pgt`:
|
||||
|
||||
```assembly
|
||||
NEXT_PAGE(level2_fixmap_pgt)
|
||||
@ -96,7 +96,7 @@ As in previous example (in `__fix_to_virt` macro), we start from the top of the
|
||||
|
||||
That's all. For this moment we know a little about `fix-mapped` addresses, but this is enough to go next.
|
||||
|
||||
`Fix-mapped` addresses are used in different [places](http://lxr.free-electrons.com/ident?i=fix_to_virt) in the linux kernel. `IDT` descriptor stored there, [Intel Trusted Execution Technology](http://en.wikipedia.org/wiki/Trusted_Execution_Technology) UUID stored in the `fix-mapped` area started from `FIX_TBOOT_BASE` index, [Xen](http://en.wikipedia.org/wiki/Xen) bootmap and many more... We already saw a little about `fix-mapped` addresses in the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) about of the linux kernel initialization. We use `fix-mapped` area in the early `ioremap` initialization. Let's look at it more closely and try to understand what `ioremap` is, how it is implemented in the kernel and how it is related to the `fix-mapped` addresses.
|
||||
`Fix-mapped` addresses are used in different [places](http://lxr.free-electrons.com/ident?i=fix_to_virt) in the linux kernel. `IDT` descriptor stored there, [Intel Trusted Execution Technology](http://en.wikipedia.org/wiki/Trusted_Execution_Technology) UUID stored in the `fix-mapped` area started from `FIX_TBOOT_BASE` index, [Xen](http://en.wikipedia.org/wiki/Xen) bootmap and many more... We already saw a little about `fix-mapped` addresses in the fifth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) about of the linux kernel initialization. We use `fix-mapped` area in the early `ioremap` initialization. Let's look at it more closely and try to understand what `ioremap` is, how it is implemented in the kernel and how it is related to the `fix-mapped` addresses.
|
||||
|
||||
ioremap
|
||||
--------------------------------------------------------------------------------
|
||||
@ -149,7 +149,7 @@ As we can see it takes three parameters:
|
||||
* `n` - length of region;
|
||||
* `name` - name of requester.
|
||||
|
||||
`request_region` allocates an `I/O` port region. Very often the `check_region` function is called before the `request_region` to check that the given address range is available and the `release_region` function to release the memory region. `request_region` returns a pointer to the `resource` structure. The `resource` structure represents an abstraction for a tree-like subset of system resources. We already saw the `resource` structure in the fifth part of the kernel [initialization](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) process and it looks as follows:
|
||||
`request_region` allocates an `I/O` port region. Very often the `check_region` function is called before the `request_region` to check that the given address range is available and the `release_region` function to release the memory region. `request_region` returns a pointer to the `resource` structure. The `resource` structure represents an abstraction for a tree-like subset of system resources. We already saw the `resource` structure in the fifth part of the kernel [initialization](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) process and it looks as follows:
|
||||
|
||||
```C
|
||||
struct resource {
|
||||
@ -274,13 +274,13 @@ static inline const char *e820_type_to_string(int e820_type)
|
||||
|
||||
and we can see them in the `/proc/iomem` (read above).
|
||||
|
||||
Now let's try to understand how `ioremap` works. We already know a little about `ioremap`, we saw it in the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) about linux kernel initialization. If you have read this part, you can remember the call of the `early_ioremap_init` function from the [arch/x86/mm/ioremap.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/mm/ioremap.c). Initialization of the `ioremap` is split into two parts: there is the early part which we can use before the normal `ioremap` is available and the normal `ioremap` which is available after `vmalloc` initialization and the call of `paging_init`. We do not know anything about `vmalloc` for now, so let's consider early initialization of the `ioremap`. First of all `early_ioremap_init` checks that `fixmap` is aligned on page middle directory boundary:
|
||||
Now let's try to understand how `ioremap` works. We already know a little about `ioremap`, we saw it in the fifth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) about linux kernel initialization. If you have read this part, you can remember the call of the `early_ioremap_init` function from the [arch/x86/mm/ioremap.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/mm/ioremap.c). Initialization of the `ioremap` is split into two parts: there is the early part which we can use before the normal `ioremap` is available and the normal `ioremap` which is available after `vmalloc` initialization and the call of `paging_init`. We do not know anything about `vmalloc` for now, so let's consider early initialization of the `ioremap`. First of all `early_ioremap_init` checks that `fixmap` is aligned on page middle directory boundary:
|
||||
|
||||
```C
|
||||
BUILD_BUG_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1));
|
||||
```
|
||||
|
||||
more about `BUILD_BUG_ON` you can read in the first part about [Linux Kernel initialization](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html). So `BUILD_BUG_ON` macro raises a compilation error if the given expression is true. In the next step after this check, we can see call of the `early_ioremap_setup` function from the [mm/early_ioremap.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/mm/early_ioremap.c). This function presents generic initialization of the `ioremap`. `early_ioremap_setup` function fills the `slot_virt` array with the virtual addresses of the early fixmaps. All early fixmaps are after `__end_of_permanent_fixed_addresses` in memory. They start at `FIX_BITMAP_BEGIN` (top) and end with `FIX_BITMAP_END` (down). Actually there are `512` temporary boot-time mappings, used by early `ioremap`:
|
||||
more about `BUILD_BUG_ON` you can read in the first part about [Linux Kernel initialization](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html). So `BUILD_BUG_ON` macro raises a compilation error if the given expression is true. In the next step after this check, we can see call of the `early_ioremap_setup` function from the [mm/early_ioremap.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/mm/early_ioremap.c). This function presents generic initialization of the `ioremap`. `early_ioremap_setup` function fills the `slot_virt` array with the virtual addresses of the early fixmaps. All early fixmaps are after `__end_of_permanent_fixed_addresses` in memory. They start at `FIX_BITMAP_BEGIN` (top) and end with `FIX_BITMAP_END` (down). Actually there are `512` temporary boot-time mappings, used by early `ioremap`:
|
||||
|
||||
```
|
||||
#define NR_FIX_BTMAPS 64
|
||||
@ -335,7 +335,7 @@ pmd_populate_kernel(&init_mm, pmd, bm_pte);
|
||||
|
||||
`pmd_populate_kernel` takes three parameters:
|
||||
|
||||
* `init_mm` - memory descriptor of the `init` process (you can read about it in the previous [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html));
|
||||
* `init_mm` - memory descriptor of the `init` process (you can read about it in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html));
|
||||
* `pmd` - page middle directory of the beginning of the `ioremap` fixmaps;
|
||||
* `bm_pte` - early `ioremap` page table entries array which defined as:
|
||||
|
||||
@ -535,5 +535,5 @@ Links
|
||||
* [e820](http://en.wikipedia.org/wiki/E820)
|
||||
* [Memory management unit](http://en.wikipedia.org/wiki/Memory_management_unit)
|
||||
* [TLB](http://en.wikipedia.org/wiki/Translation_lookaside_buffer)
|
||||
* [Paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html)
|
||||
* [Linux kernel memory management Part 1.](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-1.html)
|
||||
* [Paging](https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-1.html)
|
||||
* [Linux kernel memory management Part 1.](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-1.html)
|
@ -4,7 +4,7 @@ Linux kernel memory management Part 3.
|
||||
Introduction to the kmemcheck in the Linux kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the third part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/mm/) which describes [memory management](https://en.wikipedia.org/wiki/Memory_management) in the Linux kernel and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html) of this chapter we met two memory management related concepts:
|
||||
This is the third part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/MM/) which describes [memory management](https://en.wikipedia.org/wiki/Memory_management) in the Linux kernel and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-2.html) of this chapter we met two memory management related concepts:
|
||||
|
||||
* `Fix-Mapped Addresses`;
|
||||
* `ioremap`.
|
||||
@ -61,7 +61,7 @@ $ sudo cat /proc/ioports
|
||||
...
|
||||
```
|
||||
|
||||
can show us lists of currently registered port regions used for input or output communication with a device. All memory-mapped I/O addresses are not used by the kernel directly. So, before the Linux kernel can use such memory, it must map it to the virtual memory space which is the main purpose of the `ioremap` mechanism. Note that we saw only early `ioremap` in the previous [part](https://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html). Soon we will look at the implementation of the non-early `ioremap` function. But before this we must learn other things, like a different types of memory allocators and etc., because in other way it will be very difficult to understand it.
|
||||
can show us lists of currently registered port regions used for input or output communication with a device. All memory-mapped I/O addresses are not used by the kernel directly. So, before the Linux kernel can use such memory, it must map it to the virtual memory space which is the main purpose of the `ioremap` mechanism. Note that we saw only early `ioremap` in the previous [part](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-2.html). Soon we will look at the implementation of the non-early `ioremap` function. But before this we must learn other things, like a different types of memory allocators and etc., because in other way it will be very difficult to understand it.
|
||||
|
||||
So, before we will move on to the non-early [memory management](https://en.wikipedia.org/wiki/Memory_management) of the Linux kernel, we will see some mechanisms which provide special abilities for [debugging](https://en.wikipedia.org/wiki/Debugging), check of [memory leaks](https://en.wikipedia.org/wiki/Memory_leak), memory control and etc. It will be easier to understand how memory management arranged in the Linux kernel after learning of all of these things.
|
||||
|
||||
@ -148,7 +148,7 @@ Ok, so we know that `kmemcheck` provides mechanism to check usage of `uninitiali
|
||||
struct my_struct *my_struct = kmalloc(sizeof(struct my_struct), GFP_KERNEL);
|
||||
```
|
||||
|
||||
or in other words somebody wants to access a [page](https://en.wikipedia.org/wiki/Page_%28computer_memory%29), a [page fault](https://en.wikipedia.org/wiki/Page_fault) exception is generated. This is achieved by the fact that the `kmemcheck` marks memory pages as `non-present` (more about this you can read in the special part which is devoted to [paging](https://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html)). If a `page fault` exception is occurred, the exception handler knows about it and in a case when the `kmemcheck` is enabled it transfers control to it. After the `kmemcheck` will finish its checks, the page will be marked as `present` and the interrupted code will be able to continue execution. There is little subtlety in this chain. When the first instruction of interrupted code will be executed, the `kmemcheck` will mark the page as `non-present` again. In this way next access to memory will be caught again.
|
||||
or in other words somebody wants to access a [page](https://en.wikipedia.org/wiki/Page_%28computer_memory%29), a [page fault](https://en.wikipedia.org/wiki/Page_fault) exception is generated. This is achieved by the fact that the `kmemcheck` marks memory pages as `non-present` (more about this you can read in the special part which is devoted to [Paging](https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-1.html)). If a `page fault` exception is occurred, the exception handler knows about it and in a case when the `kmemcheck` is enabled it transfers control to it. After the `kmemcheck` will finish its checks, the page will be marked as `present` and the interrupted code will be able to continue execution. There is little subtlety in this chain. When the first instruction of interrupted code will be executed, the `kmemcheck` will mark the page as `non-present` again. In this way next access to memory will be caught again.
|
||||
|
||||
We just considered the `kmemcheck` mechanism from theoretical side. Now let's consider how it is implemented in the Linux kernel.
|
||||
|
||||
@ -190,7 +190,7 @@ early_param("kmemcheck", param_kmemcheck);
|
||||
|
||||
As we already saw, the `param_kmemcheck` may have one of the following values: `0` (enabled), `1` (disabled) or `2` (one-shot). The implementation of the `param_kmemcheck` is pretty simple. We just convert string value of the `kmemcheck` command line option to integer representation and set it to the `kmemcheck_enabled` variable.
|
||||
|
||||
The second stage will be executed during initialization of the Linux kernel, rather during initialization of early [initcalls](https://0xax.gitbooks.io/linux-insides/content/Concepts/initcall.html). The second stage is represented by the `kmemcheck_init`:
|
||||
The second stage will be executed during initialization of the Linux kernel, rather during initialization of early [initcalls](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-3.html). The second stage is represented by the `kmemcheck_init`:
|
||||
|
||||
```C
|
||||
int __init kmemcheck_init(void)
|
||||
@ -296,7 +296,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
|
||||
}
|
||||
```
|
||||
|
||||
The `kmemcheck_active` gets `kmemcheck_context` [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) structure and return the result of comparison of the `balance` field of this structure with zero:
|
||||
The `kmemcheck_active` gets `kmemcheck_context` [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) structure and return the result of comparison of the `balance` field of this structure with zero:
|
||||
|
||||
```
|
||||
bool kmemcheck_active(struct pt_regs *regs)
|
||||
@ -337,7 +337,7 @@ Last two steps of the `kmemcheck_fault` function is to call the `kmemcheck_acces
|
||||
static struct kmemcheck_error error_fifo[CONFIG_KMEMCHECK_QUEUE_SIZE];
|
||||
```
|
||||
|
||||
The `kmemcheck` mechanism declares special [tasklet](https://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html):
|
||||
The `kmemcheck` mechanism declares special [tasklet](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-9.html):
|
||||
|
||||
```C
|
||||
static DECLARE_TASKLET(kmemcheck_tasklet, &do_wakeup, 0);
|
||||
@ -422,13 +422,12 @@ Links
|
||||
* [memory leaks](https://en.wikipedia.org/wiki/Memory_leak)
|
||||
* [kmemcheck documentation](https://www.kernel.org/doc/Documentation/kmemcheck.txt)
|
||||
* [valgrind](https://en.wikipedia.org/wiki/Valgrind)
|
||||
* [paging](https://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html)
|
||||
* [Paging](https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-1.html)
|
||||
* [page fault](https://en.wikipedia.org/wiki/Page_fault)
|
||||
* [initcalls](https://0xax.gitbooks.io/linux-insides/content/Concepts/initcall.html)
|
||||
* [initcalls](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-3.html)
|
||||
* [opcode](https://en.wikipedia.org/wiki/Opcode)
|
||||
* [translation lookaside buffer](https://en.wikipedia.org/wiki/Translation_lookaside_buffer)
|
||||
* [per-cpu variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [per-cpu variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html)
|
||||
* [flags register](https://en.wikipedia.org/wiki/FLAGS_register)
|
||||
* [tasklet](https://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html)
|
||||
* [Paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html)
|
||||
* [tasklet](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-9.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-2.html)
|
@ -4,7 +4,7 @@ Linux kernel development
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As you already may know, I've started a series of [blog posts](http://0xax.github.io/categories/assembly/) about assembler programming for `x86_64` architecture in the last year. I have never written a line of low-level code before this moment, except for a couple of toy `Hello World` examples in university. It was a long time ago and, as I already said, I didn't write low-level code at all. Some time ago I became interested in such things. I understood that I can write programs, but didn't actually understand how my program is arranged.
|
||||
As you already may know, I've started a series of [blog posts](https://0xax.github.io/categories/assembler/) about assembler programming for `x86_64` architecture in the last year. I have never written a line of low-level code before this moment, except for a couple of toy `Hello World` examples in university. It was a long time ago and, as I already said, I didn't write low-level code at all. Some time ago I became interested in such things. I understood that I can write programs, but didn't actually understand how my program is arranged.
|
||||
|
||||
After writing some assembler code I began to understand how my program looks after compilation, **approximately**. But anyway, I didn't understand many other things. For example: what occurs when the `syscall` instruction is executed in my assembler, what occurs when the `printf` function starts to work or how can my program talk with other computers via network. [Assembler](https://en.wikipedia.org/wiki/Assembly_language#Assembler) programming language didn't give me answers to my questions and I decided to go deeper in my research. I started to learn from the source code of the Linux kernel and tried to understand the things that I'm interested in. The source code of the Linux kernel didn't give me the answers to **all** of my questions, but now my knowledge about the Linux kernel and the processes around it is much better.
|
||||
|
||||
@ -467,7 +467,7 @@ Please note that English is not my first language, and I am really sorry for any
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [blog posts about assembly programming for x86_64](http://0xax.github.io/categories/assembly/)
|
||||
* [blog posts about assembly programming for x86_64](https://0xax.github.io/categories/assembler/)
|
||||
* [Assembler](https://en.wikipedia.org/wiki/Assembly_language#Assembler)
|
||||
* [distro](https://en.wikipedia.org/wiki/Linux_distribution)
|
||||
* [package manager](https://en.wikipedia.org/wiki/Package_manager)
|
@ -1,7 +1,7 @@
|
||||
Introduction
|
||||
---------------
|
||||
|
||||
During the writing of the [linux-insides](http://0xax.gitbooks.io/linux-insides/content/) book I have received many emails with questions related to the [linker](https://en.wikipedia.org/wiki/Linker_%28computing%29) script and linker-related subjects. So I've decided to write this to cover some aspects of the linker and the linking of object files.
|
||||
During the writing of the [linux-insides](https://0xax.gitbooks.io/linux-insides/content/) book I have received many emails with questions related to the [linker](https://en.wikipedia.org/wiki/Linker_%28computing%29) script and linker-related subjects. So I've decided to write this to cover some aspects of the linker and the linking of object files.
|
||||
|
||||
If we open the `Linker` page on Wikipedia, we will see following definition:
|
||||
|
||||
@ -569,7 +569,7 @@ Disassembly of section .data:
|
||||
...
|
||||
```
|
||||
|
||||
Apart from the commands we have already seen, there are a few others. The first is the `ASSERT(exp, message)` that ensures that given expression is not zero. If it is zero, then exit the linker with an error code and print the given error message. If you've read about Linux kernel booting process in the [linux-insides](http://0xax.gitbooks.io/linux-insides/content/) book, you may know that the setup header of the Linux kernel has offset `0x1f1`. In the linker script of the Linux kernel we can find a check for this:
|
||||
Apart from the commands we have already seen, there are a few others. The first is the `ASSERT(exp, message)` that ensures that given expression is not zero. If it is zero, then exit the linker with an error code and print the given error message. If you've read about Linux kernel booting process in the [linux-insides](https://0xax.gitbooks.io/linux-insides/content/) book, you may know that the setup header of the Linux kernel has offset `0x1f1`. In the linker script of the Linux kernel we can find a check for this:
|
||||
|
||||
```
|
||||
. = ASSERT(hdr == 0x1f1, "The setup header has the wrong offset!");
|
||||
@ -631,12 +631,12 @@ Please note that English is not my first language, and I am really sorry for any
|
||||
Links
|
||||
-----------------
|
||||
|
||||
* [Book about Linux kernel insides](http://0xax.gitbooks.io/linux-insides/content/)
|
||||
* [Book about Linux kernel insides](https://0xax.gitbooks.io/linux-insides/content/)
|
||||
* [linker](https://en.wikipedia.org/wiki/Linker_%28computing%29)
|
||||
* [object files](https://en.wikipedia.org/wiki/Object_file)
|
||||
* [glibc](https://en.wikipedia.org/wiki/GNU_C_Library)
|
||||
* [opcode](https://en.wikipedia.org/wiki/Opcode)
|
||||
* [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format)
|
||||
* [GNU linker](https://en.wikipedia.org/wiki/GNU_linker)
|
||||
* [My posts about assembly programming for x86_64](http://0xax.github.io/categories/assembly/)
|
||||
* [My posts about assembly programming for x86_64](https://0xax.github.io/categories/assembler/)
|
||||
* [readelf](https://sourceware.org/binutils/docs/binutils/readelf.html)
|
@ -6,7 +6,7 @@ Introduction
|
||||
|
||||
Despite the [linux-insides](https://www.gitbook.com/book/0xax/linux-insides/details) described mostly Linux kernel related stuff, I have decided to write this one part which mostly related to userspace.
|
||||
|
||||
There is already fourth [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-4.html) of [System calls](https://en.wikipedia.org/wiki/System_call) chapter which describes what does the Linux kernel do when we want to start a program. In this part I want to explore what happens when we run a program on a Linux machine from userspace perspective.
|
||||
There is already fourth [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-4.html) of [System calls](https://en.wikipedia.org/wiki/System_call) chapter which describes what does the Linux kernel do when we want to start a program. In this part I want to explore what happens when we run a program on a Linux machine from userspace perspective.
|
||||
|
||||
I don't know how about you, but in my university I learn that a `C` program starts executing from the function which is called `main`. And that's partly true. Whenever we are starting to write new program, we start our program from the following lines of code:
|
||||
|
||||
@ -123,7 +123,7 @@ Ok, everything looks pretty good up to now. You may already know that there is a
|
||||
|
||||
> The exec() family of functions replaces the current process image with a new process image.
|
||||
|
||||
All the `exec*` functions are simple frontends to the [execve](http://man7.org/linux/man-pages/man2/execve.2.html) system call. If you have read the fourth [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-4.html) of the chapter which describes [system calls](https://en.wikipedia.org/wiki/System_call), you may know that the [execve](http://linux.die.net/man/2/execve) system call is defined in the [files/exec.c](https://github.com/torvalds/linux/blob/08e4e0d0456d0ca8427b2d1ddffa30f1c3e774d7/fs/exec.c#L1888) source code file and looks like:
|
||||
All the `exec*` functions are simple frontends to the [execve](http://man7.org/linux/man-pages/man2/execve.2.html) system call. If you have read the fourth [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-4.html) of the chapter which describes [system calls](https://en.wikipedia.org/wiki/System_call), you may know that the [execve](http://linux.die.net/man/2/execve) system call is defined in the [files/exec.c](https://github.com/torvalds/linux/blob/08e4e0d0456d0ca8427b2d1ddffa30f1c3e774d7/fs/exec.c#L1888) source code file and looks like:
|
||||
|
||||
```C
|
||||
SYSCALL_DEFINE3(execve,
|
||||
@ -135,7 +135,7 @@ SYSCALL_DEFINE3(execve,
|
||||
}
|
||||
```
|
||||
|
||||
It takes an executable file name, set of command line arguments, and set of enviroment variables. As you may guess, everything is done by the `do_execve` function. I will not describe the implementation of the `do_execve` function in detail because you can read about this in [here](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-4.html). But in short words, the `do_execve` function does many checks like `filename` is valid, limit of launched processes is not exceed in our system and etc. After all of these checks, this function parses our executable file which is represented in [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) format, creates memory descriptor for newly executed executable file and fills it with the appropriate values like area for the stack, heap and etc. When the setup of new binary image is done, the `start_thread` function will set up one new process. This function is architecture-specific and for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture, its definition will be located in the [arch/x86/kernel/process_64.c](https://github.com/torvalds/linux/blob/08e4e0d0456d0ca8427b2d1ddffa30f1c3e774d7/arch/x86/kernel/process_64.c#L239) source code file.
|
||||
It takes an executable file name, set of command line arguments, and set of enviroment variables. As you may guess, everything is done by the `do_execve` function. I will not describe the implementation of the `do_execve` function in detail because you can read about this in [here](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-4.html). But in short words, the `do_execve` function does many checks like `filename` is valid, limit of launched processes is not exceed in our system and etc. After all of these checks, this function parses our executable file which is represented in [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) format, creates memory descriptor for newly executed executable file and fills it with the appropriate values like area for the stack, heap and etc. When the setup of new binary image is done, the `start_thread` function will set up one new process. This function is architecture-specific and for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture, its definition will be located in the [arch/x86/kernel/process_64.c](https://github.com/torvalds/linux/blob/08e4e0d0456d0ca8427b2d1ddffa30f1c3e774d7/arch/x86/kernel/process_64.c#L239) source code file.
|
||||
|
||||
The `start_thread` function sets new value to [segment registers](https://en.wikipedia.org/wiki/X86_memory_segmentation) and program execution address. From this point, our new process is ready to start. Once the [context switch](https://en.wikipedia.org/wiki/Context_switch) will be done, control will be returned to userspace with new values of registers and the new executable will be started to execute.
|
||||
|
||||
@ -150,7 +150,7 @@ In the previous paragraph we saw how an executable file is prepared to run by th
|
||||
$ gcc -Wall program.c -o sum
|
||||
```
|
||||
|
||||
You may guess that `_start` comes from the [stanard libray](https://en.wikipedia.org/wiki/Standard_library) and that's true. If you try to compile our program again and pass the `-v` option to gcc which will enable `verbose mode`, you will see a long output. The full output is not interesting for us, let's look at the following steps:
|
||||
You may guess that `_start` comes from the [standard library](https://en.wikipedia.org/wiki/Standard_library) and that's true. If you try to compile our program again and pass the `-v` option to gcc which will enable `verbose mode`, you will see a long output. The full output is not interesting for us, let's look at the following steps:
|
||||
|
||||
First of all, our program should be compiled with `gcc`:
|
||||
|
12
README.md
12
README.md
@ -3,7 +3,7 @@ linux-insides
|
||||
|
||||
A book-in-progress about the linux kernel and its insides.
|
||||
|
||||
**The goal is simple** - to share my modest knowledge about the insides of the linux kernel and help people who are interested in linux kernel insides, and other low-level subject matter.
|
||||
**The goal is simple** - to share my modest knowledge about the insides of the linux kernel and help people who are interested in linux kernel insides, and other low-level subject matter.Feel free to go through the book [Start here](https://github.com/0xAX/linux-insides/blob/master/SUMMARY.md)
|
||||
|
||||
**Questions/Suggestions**: Feel free about any questions or suggestions by pinging me at twitter [@0xAX](https://twitter.com/0xAX), adding an [issue](https://github.com/0xAX/linux-insides/issues/new) or just drop me an [email](mailto:anotherworldofworld@gmail.com).
|
||||
|
||||
@ -22,11 +22,6 @@ On other languages
|
||||
* [Russian](https://github.com/proninyaroslav/linux-insides-ru)
|
||||
* [Spanish](https://github.com/leolas95/linux-insides)
|
||||
* [Turkish](https://github.com/ayyucedemirbas/linux-insides_Turkish)
|
||||
|
||||
LICENSE
|
||||
-------------
|
||||
|
||||
Licensed [BY-NC-SA Creative Commons](http://creativecommons.org/licenses/by-nc-sa/4.0/).
|
||||
|
||||
Contributions
|
||||
--------------
|
||||
@ -41,3 +36,8 @@ Author
|
||||
---------------
|
||||
|
||||
[@0xAX](https://twitter.com/0xAX)
|
||||
|
||||
LICENSE
|
||||
-------------
|
||||
|
||||
Licensed [BY-NC-SA Creative Commons](http://creativecommons.org/licenses/by-nc-sa/4.0/).
|
||||
|
101
SUMMARY.md
101
SUMMARY.md
@ -6,6 +6,7 @@
|
||||
* [Video mode initialization and transition to protected mode](Booting/linux-bootstrap-3.md)
|
||||
* [Transition to 64-bit mode](Booting/linux-bootstrap-4.md)
|
||||
* [Kernel decompression](Booting/linux-bootstrap-5.md)
|
||||
* [Kernel load address randomization](Booting/linux-bootstrap-6.md)
|
||||
* [Initialization](Initialization/README.md)
|
||||
* [First steps in the kernel](Initialization/linux-initialization-1.md)
|
||||
* [Early interrupts handler](Initialization/linux-initialization-2.md)
|
||||
@ -17,73 +18,73 @@
|
||||
* [Scheduler initialization](Initialization/linux-initialization-8.md)
|
||||
* [RCU initialization](Initialization/linux-initialization-9.md)
|
||||
* [End of initialization](Initialization/linux-initialization-10.md)
|
||||
* [Interrupts](interrupts/README.md)
|
||||
* [Introduction](interrupts/interrupts-1.md)
|
||||
* [Start to dive into interrupts](interrupts/interrupts-2.md)
|
||||
* [Interrupt handlers](interrupts/interrupts-3.md)
|
||||
* [Initialization of non-early interrupt gates](interrupts/interrupts-4.md)
|
||||
* [Implementation of some exception handlers](interrupts/interrupts-5.md)
|
||||
* [Handling Non-Maskable interrupts](interrupts/interrupts-6.md)
|
||||
* [Dive into external hardware interrupts](interrupts/interrupts-7.md)
|
||||
* [Initialization of external hardware interrupts structures](interrupts/interrupts-8.md)
|
||||
* [Softirq, Tasklets and Workqueues](interrupts/interrupts-9.md)
|
||||
* [Last part](interrupts/interrupts-10.md)
|
||||
* [Interrupts](Interrupts/README.md)
|
||||
* [Introduction](Interrupts/linux-interrupts-1.md)
|
||||
* [Start to dive into interrupts](Interrupts/linux-interrupts-2.md)
|
||||
* [Interrupt handlers](Interrupts/linux-interrupts-3.md)
|
||||
* [Initialization of non-early interrupt gates](Interrupts/linux-interrupts-4.md)
|
||||
* [Implementation of some exception handlers](Interrupts/linux-interrupts-5.md)
|
||||
* [Handling Non-Maskable interrupts](Interrupts/linux-interrupts-6.md)
|
||||
* [Dive into external hardware interrupts](Interrupts/linux-interrupts-7.md)
|
||||
* [Initialization of external hardware interrupts structures](Interrupts/linux-interrupts-8.md)
|
||||
* [Softirq, Tasklets and Workqueues](Interrupts/linux-interrupts-9.md)
|
||||
* [Last part](Interrupts/linux-interrupts-10.md)
|
||||
* [System calls](SysCall/README.md)
|
||||
* [Introduction to system calls](SysCall/syscall-1.md)
|
||||
* [How the Linux kernel handles a system call](SysCall/syscall-2.md)
|
||||
* [vsyscall and vDSO](SysCall/syscall-3.md)
|
||||
* [How the Linux kernel runs a program](SysCall/syscall-4.md)
|
||||
* [Implementation of the open system call](SysCall/syscall-5.md)
|
||||
* [Limits on resources in Linux](SysCall/syscall-6.md)
|
||||
* [Introduction to system calls](SysCall/linux-syscall-1.md)
|
||||
* [How the Linux kernel handles a system call](SysCall/linux-syscall-2.md)
|
||||
* [vsyscall and vDSO](SysCall/linux-syscall-3.md)
|
||||
* [How the Linux kernel runs a program](SysCall/linux-syscall-4.md)
|
||||
* [Implementation of the open system call](SysCall/linux-syscall-5.md)
|
||||
* [Limits on resources in Linux](SysCall/linux-syscall-6.md)
|
||||
* [Timers and time management](Timers/README.md)
|
||||
* [Introduction](Timers/timers-1.md)
|
||||
* [Clocksource framework](Timers/timers-2.md)
|
||||
* [The tick broadcast framework and dyntick](Timers/timers-3.md)
|
||||
* [Introduction to timers](Timers/timers-4.md)
|
||||
* [Clockevents framework](Timers/timers-5.md)
|
||||
* [x86 related clock sources](Timers/timers-6.md)
|
||||
* [Time related system calls](Timers/timers-7.md)
|
||||
* [Introduction](Timers/linux-timers-1.md)
|
||||
* [Clocksource framework](Timers/linux-timers-2.md)
|
||||
* [The tick broadcast framework and dyntick](Timers/linux-timers-3.md)
|
||||
* [Introduction to timers](Timers/linux-timers-4.md)
|
||||
* [Clockevents framework](Timers/linux-timers-5.md)
|
||||
* [x86 related clock sources](Timers/linux-timers-6.md)
|
||||
* [Time related system calls](Timers/linux-timers-7.md)
|
||||
* [Synchronization primitives](SyncPrim/README.md)
|
||||
* [Introduction to spinlocks](SyncPrim/sync-1.md)
|
||||
* [Queued spinlocks](SyncPrim/sync-2.md)
|
||||
* [Semaphores](SyncPrim/sync-3.md)
|
||||
* [Mutex](SyncPrim/sync-4.md)
|
||||
* [Reader/Writer semaphores](SyncPrim/sync-5.md)
|
||||
* [SeqLock](SyncPrim/sync-6.md)
|
||||
* [Introduction to spinlocks](SyncPrim/linux-sync-1.md)
|
||||
* [Queued spinlocks](SyncPrim/linux-sync-2.md)
|
||||
* [Semaphores](SyncPrim/linux-sync-3.md)
|
||||
* [Mutex](SyncPrim/linux-sync-4.md)
|
||||
* [Reader/Writer semaphores](SyncPrim/linux-sync-5.md)
|
||||
* [SeqLock](SyncPrim/linux-sync-6.md)
|
||||
* [RCU]()
|
||||
* [Lockdep]()
|
||||
* [Memory management](mm/README.md)
|
||||
* [Memblock](mm/linux-mm-1.md)
|
||||
* [Fixmaps and ioremap](mm/linux-mm-2.md)
|
||||
* [kmemcheck](mm/linux-mm-3.md)
|
||||
* [Memory management](MM/README.md)
|
||||
* [Memblock](MM/linux-mm-1.md)
|
||||
* [Fixmaps and ioremap](MM/linux-mm-2.md)
|
||||
* [kmemcheck](MM/linux-mm-3.md)
|
||||
* [Cgroups](Cgroups/README.md)
|
||||
* [Introduction to Control Groups](Cgroups/cgroups1.md)
|
||||
* [Introduction to Control Groups](Cgroups/linux-cgroups-1.md)
|
||||
* [SMP]()
|
||||
* [Concepts](Concepts/README.md)
|
||||
* [Per-CPU variables](Concepts/per-cpu.md)
|
||||
* [Cpumasks](Concepts/cpumask.md)
|
||||
* [The initcall mechanism](Concepts/initcall.md)
|
||||
* [Notification Chains](Concepts/notification_chains.md)
|
||||
* [Per-CPU variables](Concepts/linux-cpu-1.md)
|
||||
* [Cpumasks](Concepts/linux-cpu-2.md)
|
||||
* [The initcall mechanism](Concepts/linux-cpu-3.md)
|
||||
* [Notification Chains](Concepts/linux-cpu-4.md)
|
||||
* [Data Structures in the Linux Kernel](DataStructures/README.md)
|
||||
* [Doubly linked list](DataStructures/dlist.md)
|
||||
* [Radix tree](DataStructures/radix-tree.md)
|
||||
* [Bit arrays](DataStructures/bitmap.md)
|
||||
* [Doubly linked list](DataStructures/linux-datastructures-1.md)
|
||||
* [Radix tree](DataStructures/linux-datastructures-2.md)
|
||||
* [Bit arrays](DataStructures/linux-datastructures-3.md)
|
||||
* [Theory](Theory/README.md)
|
||||
* [Paging](Theory/Paging.md)
|
||||
* [Elf64](Theory/ELF.md)
|
||||
* [Inline assembly](Theory/asm.md)
|
||||
* [Paging](Theory/linux-theory-1.md)
|
||||
* [Elf64](Theory/linux-theory-2.md)
|
||||
* [Inline assembly](Theory/linux-theory-3.md)
|
||||
* [CPUID]()
|
||||
* [MSR]()
|
||||
* [Initial ram disk]()
|
||||
* [initrd]()
|
||||
* [Misc](Misc/README.md)
|
||||
* [How the kernel is compiled](Misc/how_kernel_compiled.md)
|
||||
* [Linkers](Misc/linkers.md)
|
||||
* [Linux kernel development](Misc/contribute.md)
|
||||
* [Program startup process in userspace](Misc/program_startup.md)
|
||||
* [Linux kernel development](Misc/linux-misc-1.md)
|
||||
* [How the kernel is compiled](Misc/linux-misc-2.md)
|
||||
* [Linkers](Misc/linux-misc-3.md)
|
||||
* [Program startup process in userspace](Misc/linux-misc-4.md)
|
||||
* [Write and Submit your first Linux kernel Patch]()
|
||||
* [Data types in the kernel]()
|
||||
* [KernelStructures](KernelStructures/README.md)
|
||||
* [IDT](KernelStructures/idt.md)
|
||||
* [IDT](KernelStructures/linux-kernelstructure-1.md)
|
||||
* [Useful links](LINKS.md)
|
||||
* [Contributors](contributors.md)
|
||||
|
BIN
Scripts/LinuxKernelInsides.pdf
Normal file
BIN
Scripts/LinuxKernelInsides.pdf
Normal file
Binary file not shown.
21
Scripts/README.md
Normal file
21
Scripts/README.md
Normal file
@ -0,0 +1,21 @@
|
||||
# Scripts
|
||||
|
||||
## Description
|
||||
|
||||
`get_all_links.py` : justify one link is live or dead with network connection
|
||||
|
||||
`latex.sh` : a script for converting Markdown files in each of the subdirectories into a unified PDF typeset in LaTeX
|
||||
|
||||
## Usage
|
||||
|
||||
`get_all_links.py` :
|
||||
|
||||
```
|
||||
./get_all_links.py ../
|
||||
```
|
||||
|
||||
`latex.sh` :
|
||||
|
||||
```
|
||||
./latex.sh
|
||||
```
|
77
Scripts/get_all_links.py
Executable file
77
Scripts/get_all_links.py
Executable file
@ -0,0 +1,77 @@
|
||||
#!/usr/bin/env python
|
||||
|
||||
from __future__ import print_function
|
||||
from socket import timeout
|
||||
|
||||
import os
|
||||
import sys
|
||||
import codecs
|
||||
import re
|
||||
|
||||
import markdown
|
||||
|
||||
try:
|
||||
# compatible for python2
|
||||
from urllib2 import urlopen
|
||||
from urllib2 import HTTPError
|
||||
from urllib2 import URLError
|
||||
except ImportError:
|
||||
# compatible for python3
|
||||
from urllib.request import urlopen
|
||||
from urllib.error import HTTPError
|
||||
from urllib.error import URLError
|
||||
|
||||
def check_live_url(url):
|
||||
|
||||
result = False
|
||||
try:
|
||||
ret = urlopen(url, timeout=2)
|
||||
result = (ret.code == 200)
|
||||
except HTTPError as e:
|
||||
print(e, file=sys.stderr)
|
||||
except URLError as e:
|
||||
print(e, file=sys.stderr)
|
||||
except timeout as e:
|
||||
print(e, file=sys.stderr)
|
||||
except Exception as e:
|
||||
print(e, file=sys.stderr)
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def main(path):
|
||||
|
||||
filenames = []
|
||||
for (dirpath, dnames, fnames) in os.walk(path):
|
||||
for fname in fnames:
|
||||
if fname.endswith('.md'):
|
||||
filenames.append(os.sep.join([dirpath, fname]))
|
||||
|
||||
urls = []
|
||||
|
||||
for filename in filenames:
|
||||
fd = codecs.open(filename, mode="r", encoding="utf-8")
|
||||
for line in fd.readlines():
|
||||
refs = re.findall(r'(?<=<a href=")[^"]*', markdown.markdown(line))
|
||||
for ref in refs:
|
||||
if ref not in urls:
|
||||
urls.append(ref)
|
||||
|
||||
#print(len(urls))
|
||||
|
||||
for url in urls:
|
||||
if not url.startswith("http"):
|
||||
print("markdown file name: " + url)
|
||||
continue
|
||||
if check_live_url(url):
|
||||
print(url)
|
||||
else:
|
||||
print(url, file=sys.stderr)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
||||
if len(sys.argv) == 2:
|
||||
main(sys.argv[1])
|
||||
else:
|
||||
print("Choose one path as argument one")
|
26
Scripts/latex.sh
Executable file
26
Scripts/latex.sh
Executable file
@ -0,0 +1,26 @@
|
||||
# latex.sh
|
||||
# A script for converting Markdown files in each of the subdirectories into a unified PDF typeset in LaTeX.
|
||||
# Requires TexLive, Pandoc templates and pdfunite. Not necessary if you just want to read the PDF, only if you're compiling it yourself.
|
||||
|
||||
#!/bin/bash
|
||||
rm -r build
|
||||
mkdir build
|
||||
for D in $(ls ../); do
|
||||
if [ -d "../${D}" ]
|
||||
then
|
||||
echo "Converting $D . . ."
|
||||
pandoc ../$D/README.md ../$D/linux-*.md -o build/$D.tex --template default
|
||||
fi
|
||||
done
|
||||
|
||||
cd ./build
|
||||
for f in *.tex
|
||||
do
|
||||
pdflatex -interaction=nonstopmode $f
|
||||
done
|
||||
|
||||
cd ../
|
||||
pandoc ../README.md ../SUMMARY.md ../CONTRIBUTING.md ../contributors.md \
|
||||
-o ./build/Preface.tex --template default
|
||||
|
||||
pdfunite ./build/*.pdf LinuxKernelInsides.pdf
|
@ -2,9 +2,9 @@
|
||||
|
||||
This chapter describes synchronization primitives in the Linux kernel.
|
||||
|
||||
* [Introduction to spinlocks](sync-1.md) - the first part of this chapter describes implementation of spinlock mechanism in the Linux kernel.
|
||||
* [Queued spinlocks](sync-2.md) - the second part describes another type of spinlocks - queued spinlocks.
|
||||
* [Semaphores](sync-3.md) - this part describes implementation of `semaphore` synchronization primitive in the Linux kernel.
|
||||
* [Mutual exclusion](sync-4.md) - this part describes - `mutex` in the Linux kernel.
|
||||
* [Reader/Writer semaphores](sync-5.md) - this part describes special type of semaphores - `reader/writer` semaphores.
|
||||
* [Sequential locks](sync-6.md) - this part describes sequential locks in the Linux kernel.
|
||||
* [Introduction to spinlocks](linux-sync-1.md) - the first part of this chapter describes implementation of spinlock mechanism in the Linux kernel.
|
||||
* [Queued spinlocks](linux-sync-2.md) - the second part describes another type of spinlocks - queued spinlocks.
|
||||
* [Semaphores](linux-sync-3.md) - this part describes implementation of `semaphore` synchronization primitive in the Linux kernel.
|
||||
* [Mutual exclusion](linux-sync-4.md) - this part describes - `mutex` in the Linux kernel.
|
||||
* [Reader/Writer semaphores](linux-sync-5.md) - this part describes special type of semaphores - `reader/writer` semaphores.
|
||||
* [Sequential locks](linux-sync-6.md) - this part describes sequential locks in the Linux kernel.
|
||||
|
@ -4,9 +4,9 @@ Synchronization primitives in the Linux kernel. Part 1.
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This part opens new chapter in the [linux-insides](http://0xax.gitbooks.io/linux-insides/content/) book. Timers and time management related stuff was described in the previous [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html). Now time to go next. As you may understand from the part's title, this chapter will describe [synchronization](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) primitives in the Linux kernel.
|
||||
This part opens a new chapter in the [linux-insides](https://0xax.gitbooks.io/linux-insides/content/) book. Timers and time management related stuff was described in the previous [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html). Now time to go next. As you may understand from the part's title, this chapter will describe [synchronization](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) primitives in the Linux kernel.
|
||||
|
||||
As always, before we will consider something synchronization related, we will try to know what is `synchronization primitive` in general. Actually, synchronization primitive is a software mechanism which provides ability to two or more [parallel](https://en.wikipedia.org/wiki/Parallel_computing) processes or threads to not execute simultaneously on the same segment of a code. For example let's look on the following piece of code:
|
||||
As always, before we will consider something synchronization related, we will try to know what is `synchronization primitive` in general. Actually, synchronization primitive is a software mechanism which provides the ability to two or more [parallel](https://en.wikipedia.org/wiki/Parallel_computing) processes or threads to not execute simultaneously on the same segment of a code. For example, let's look on the following piece of code:
|
||||
|
||||
```C
|
||||
mutex_lock(&clocksource_mutex);
|
||||
@ -22,9 +22,9 @@ clocksource_select();
|
||||
mutex_unlock(&clocksource_mutex);
|
||||
```
|
||||
|
||||
from the [kernel/time/clocksource.c](https://github.com/torvalds/linux/master/kernel/time/clocksource.c) source code file. This code is from the `__clocksource_register_scale` function which adds the given [clocksource](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html) to the clock sources list. This function produces different operations on a list with registered clock sources. For example the `clocksource_enqueue` function adds the given clock source to the list with registered clocksources - `clocksource_list`. Note that these lines of code wrapped to two functions: `mutex_lock` and `mutex_unlock` which are takes one parameter - the `clocksource_mutex` in our case.
|
||||
from the [kernel/time/clocksource.c](https://github.com/torvalds/linux/master/kernel/time/clocksource.c) source code file. This code is from the `__clocksource_register_scale` function which adds the given [clocksource](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-2.html) to the clock sources list. This function produces different operations on a list with registered clock sources. For example, the `clocksource_enqueue` function adds the given clock source to the list with registered clocksources - `clocksource_list`. Note that these lines of code wrapped to two functions: `mutex_lock` and `mutex_unlock` which takes one parameter - the `clocksource_mutex` in our case.
|
||||
|
||||
These functions represents locking and unlocking based on [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) synchronization primitive. As `mutex_lock` will be executed, it allows us to prevent situation when two or more threads will execute this code while the `mute_unlock` will not be executed by process-owner of the mutex. In other words, we prevent parallel operations on a `clocksource_list`. Why do we need `mutex` here? What if two parallel processes will try to register a clock source. As we already know, the `clocksource_enqueue` function adds the given clock source to the `clocksource_list` list right after a clock source in the list which has the biggest rating (a registered clock source which has the highest frequency in the system):
|
||||
These functions represent locking and unlocking based on [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) synchronization primitive. As `mutex_lock` will be executed, it allows us to prevent the situation when two or more threads will execute this code while the `mutex_unlock` will not be executed by process-owner of the mutex. In other words, we prevent parallel operations on a `clocksource_list`. Why do we need `mutex` here? What if two parallel processes will try to register a clock source. As we already know, the `clocksource_enqueue` function adds the given clock source to the `clocksource_list` list right after a clock source in the list which has the biggest rating (a registered clock source which has the highest frequency in the system):
|
||||
|
||||
```C
|
||||
static void clocksource_enqueue(struct clocksource *cs)
|
||||
@ -39,7 +39,7 @@ static void clocksource_enqueue(struct clocksource *cs)
|
||||
}
|
||||
```
|
||||
|
||||
If two parallel processes will try to do it simultaneously, both process may found the same `entry` may occur [race condition](https://en.wikipedia.org/wiki/Race_condition) or in other words, the second process which will execute `list_add`, will overwrite a clock source from first thread.
|
||||
If two parallel processes will try to do it simultaneously, both process may found the same `entry` may occur [race condition](https://en.wikipedia.org/wiki/Race_condition) or in other words, the second process which will execute `list_add`, will overwrite a clock source from the first thread.
|
||||
|
||||
Besides this simple example, synchronization primitives are ubiquitous in the Linux kernel. If we will go through the previous [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) or other chapters again or if we will look at the Linux kernel source code in general, we will meet many places like this. We will not consider how `mutex` is implemented in the Linux kernel. Actually, the Linux kernel provides a set of different synchronization primitives like:
|
||||
|
||||
@ -87,7 +87,7 @@ typedef struct spinlock {
|
||||
} spinlock_t;
|
||||
```
|
||||
|
||||
The `raw_spinlock` structure defined in the [same](https://github.com/torvalds/linux/master/include/linux/spinlock_types.h) header file and represents implementation of `normal` spinlock. Let's look how the `raw_spinlock` structure is defined:
|
||||
The `raw_spinlock` structure defined in the [same](https://github.com/torvalds/linux/master/include/linux/spinlock_types.h) header file and represents the implementation of `normal` spinlock. Let's look how the `raw_spinlock` structure is defined:
|
||||
|
||||
```C
|
||||
typedef struct raw_spinlock {
|
||||
@ -307,13 +307,13 @@ static inline void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock)
|
||||
}
|
||||
```
|
||||
|
||||
The `__acquire` here is just [sparse](https://en.wikipedia.org/wiki/Sparse) related macro and we are not interesting in it in this moment. Location of the definition of the `arch_spin_lock` function depends on two things: the first is architecture of system and the second do we use `queued spinlocks` or not. In our case we consider only `x86_64` architecture, so the definition of the `arch_spin_lock` is represented as the macro from the [include/asm-generic/qspinlock.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/asm-generic/qspinlocks.h) header file:
|
||||
The `__acquire` here is just [sparse](https://en.wikipedia.org/wiki/Sparse) related macro and we are not interested in it in this moment. Location of the definition of the `arch_spin_lock` function depends on two things: the first is the architecture of the system and the second do we use `queued spinlocks` or not. In our case we consider only `x86_64` architecture, so the definition of the `arch_spin_lock` is represented as the macro from the [include/asm-generic/qspinlock.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/asm-generic/qspinlocks.h) header file:
|
||||
|
||||
```C
|
||||
#define arch_spin_lock(l) queued_spin_lock(l)
|
||||
```
|
||||
|
||||
if we are using `queued spinlocks`. Or in other case, the `arch_spin_lock` function is defined in the [arch/x86/include/asm/spinlock.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/spinlock.h) header file. Now we will consider only `normal spinlock` and information related to `queued spinlocks` we will see later. Let's look again on the definition of the `arch_spinlock` structure, to understand implementation of the `arch_spin_lock` function:
|
||||
if we are using `queued spinlocks`. Or in other case, the `arch_spin_lock` function is defined in the [arch/x86/include/asm/spinlock.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/spinlock.h) header file. Now we will consider only `normal spinlock` and information related to `queued spinlocks` we will see later. Let's look again on the definition of the `arch_spinlock` structure, to understand the implementation of the `arch_spin_lock` function:
|
||||
|
||||
```C
|
||||
typedef struct arch_spinlock {
|
||||
@ -362,9 +362,9 @@ At the beginning of the `arch_spin_lock` function we can initialization of the `
|
||||
#define __TICKET_LOCK_INC 1
|
||||
```
|
||||
|
||||
In the next line we execute [xadd](http://x86.renejeschke.de/html/file_module_x86_id_327.html) operation on the `inc` and `lock->tickets`. After this operation the `inc` will store value of the `tickets` of the given `lock` and the `tickets.tail` will be increased on `inc` or `1`. The `tail` value was increased on `1` which means that one process started to try to hold a lock. In the next step we do the check that checks that `head` and `tail` have the same value. If these values are equal, this means that nobody holds lock and we go to the `out` label. In the end of the `arch_spin_lock` function we may see the `barrier` macro which represents `barrier instruction` which guarantees that compiler will not change order of operations that access memory (more about memory barriers you can read in the kernel [documentation](https://www.kernel.org/doc/Documentation/memory-barriers.txt)).
|
||||
In the next line we execute [xadd](http://x86.renejeschke.de/html/file_module_x86_id_327.html) operation on the `inc` and `lock->tickets`. After this operation the `inc` will store value of the `tickets` of the given `lock` and the `tickets.tail` will be increased on `inc` or `1`. The `tail` value was increased on `1` which means that one process started to try to hold a lock. In the next step we do the check that checks that `head` and `tail` have the same value. If these values are equal, this means that nobody holds lock and we go to the `out` label. In the end of the `arch_spin_lock` function we may see the `barrier` macro which represents `barrier instruction` which guarantees that compiler will not change the order of operations that access memory (more about memory barriers you can read in the kernel [documentation](https://www.kernel.org/doc/Documentation/memory-barriers.txt)).
|
||||
|
||||
If one process held a lock and a second process started to execute the `arch_spin_lock` function, the `head` will not be `equal` to `tail`, because the `tail` will be greater than `head` on `1`. In this way, process will occur in the loop. There will be comparison between `head` and the `tail` values at each loop iteration. If these values are not equal, the `cpu_relax` will be called which is just [NOP](https://en.wikipedia.org/wiki/NOP) instruction:
|
||||
If one process held a lock and a second process started to execute the `arch_spin_lock` function, the `head` will not be `equal` to `tail`, because the `tail` will be greater than `head` on `1`. In this way, the process will occur in the loop. There will be comparison between `head` and the `tail` values at each loop iteration. If these values are not equal, the `cpu_relax` will be called which is just [NOP](https://en.wikipedia.org/wiki/NOP) instruction:
|
||||
|
||||
|
||||
```C
|
||||
@ -417,7 +417,7 @@ Links
|
||||
|
||||
* [Concurrent computing](https://en.wikipedia.org/wiki/Concurrent_computing)
|
||||
* [Synchronization](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29)
|
||||
* [Clocksource framework](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html)
|
||||
* [Clocksource framework](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-2.html)
|
||||
* [Mutex](https://en.wikipedia.org/wiki/Mutual_exclusion)
|
||||
* [Race condition](https://en.wikipedia.org/wiki/Race_condition)
|
||||
* [Atomic operations](https://en.wikipedia.org/wiki/Linearizability)
|
@ -4,9 +4,9 @@ Synchronization primitives in the Linux kernel. Part 2.
|
||||
Queued Spinlocks
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the second part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html) which describes synchronization primitives in the Linux kernel and in the first [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) of this chapter we met the first - [spinlock](https://en.wikipedia.org/wiki/Spinlock). We will continue to learn this synchronization primitive in this part. If you have read the previous part, you may remember that besides normal spinlocks, the Linux kernel provides special type of `spinlocks` - `queued spinlocks`. In this part we will try to understand what does this concept represent.
|
||||
This is the second part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html) which describes synchronization primitives in the Linux kernel and in the first [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html) of this chapter we met the first - [spinlock](https://en.wikipedia.org/wiki/Spinlock). We will continue to learn this synchronization primitive in this part. If you have read the previous part, you may remember that besides normal spinlocks, the Linux kernel provides special type of `spinlocks` - `queued spinlocks`. In this part we will try to understand what does this concept represent.
|
||||
|
||||
We saw [API](https://en.wikipedia.org/wiki/Application_programming_interface) of `spinlock` in the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html):
|
||||
We saw [API](https://en.wikipedia.org/wiki/Application_programming_interface) of `spinlock` in the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html):
|
||||
|
||||
* `spin_lock_init` - produces initialization of the given `spinlock`;
|
||||
* `spin_lock` - acquires given `spinlock`;
|
||||
@ -97,11 +97,11 @@ int unlock(lock)
|
||||
|
||||
The first thread will execute the `test_and_set` which will set the `lock` to `1`. When the second thread will call the `lock` function, it will spin in the `while` loop, until the first thread will not call the `unlock` function and the `lock` will be equal to `0`. This implementation is not very good for performance, because it has at least two problems. The first problem is that this implementation may be unfair and the thread from one processor may have long waiting time, even if it called the `lock` before other threads which are waiting for free lock too. The second problem is that all threads which want to acquire a lock, must to execute many `atomic` operations like `test_and_set` on a variable which is in shared memory. This leads to the cache invalidation as the cache of the processor will store `lock=1`, but the value of the `lock` in memory may be `1` after a thread will release this lock.
|
||||
|
||||
In the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) we saw the second type of spinlock implementation - `ticket spinlock`. This approach solves the first problem and may guarantee order of threads which want to acquire a lock, but still has a second problem.
|
||||
In the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html) we saw the second type of spinlock implementation - `ticket spinlock`. This approach solves the first problem and may guarantee order of threads which want to acquire a lock, but still has a second problem.
|
||||
|
||||
The topic of this part is `queued spinlocks`. This approach may help to solve both of these problems. The `queued spinlocks` allows to each processor to use its own memory location to spin. The basic principle of a queue-based spinlock can best be understood by studying a classic queue-based spinlock implementation called the [MCS](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) lock. Before we will look at implementation of the `queued spinlocks` in the Linux kernel, we will try to understand what is it `MCS` lock.
|
||||
|
||||
The basic idea of the `MCS` lock is in that as I already wrote in the previous paragraph, a thread spins on a local variable and each processor in the system has its own copy of these variable. In other words this concept is built on top of the [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variables concept in the Linux kernel.
|
||||
The basic idea of the `MCS` lock is in that as I already wrote in the previous paragraph, a thread spins on a local variable and each processor in the system has its own copy of these variable. In other words this concept is built on top of the [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) variables concept in the Linux kernel.
|
||||
|
||||
When the first thread wants to acquire a lock, it registers itself in the `queue` or in other words it will be added to the special `queue` and will acquire lock, because it is free for now. When the second thread will want to acquire the same lock before the first thread will release it, this thread adds its own copy of the lock variable into this `queue`. In this case the first thread will contain a `next` field which will point to the second thread. From this moment, the second thread will wait until the first thread will release its lock and notify `next` thread about this event. The first thread will be deleted from the `queue` and the second thread will be owner of a lock.
|
||||
|
||||
@ -462,7 +462,7 @@ That's all.
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the second part of the [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) chapter in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) we already met the first synchronization primitive `spinlock` provided by the Linux kernel which is implemented as `ticket spinlock`. In this part we saw another implementation of the `spinlock` mechanism - `queued spinlock`. In the next part we will continue to dive into synchronization primitives in the Linux kernel.
|
||||
This is the end of the second part of the [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) chapter in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html) we already met the first synchronization primitive `spinlock` provided by the Linux kernel which is implemented as `ticket spinlock`. In this part we saw another implementation of the `spinlock` mechanism - `queued spinlock`. In the next part we will continue to dive into synchronization primitives in the Linux kernel.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
@ -477,11 +477,11 @@ Links
|
||||
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
* [Test and Set](https://en.wikipedia.org/wiki/Test-and-set)
|
||||
* [MCS](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf)
|
||||
* [per-cpu variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [per-cpu variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html)
|
||||
* [atomic instruction](https://en.wikipedia.org/wiki/Linearizability)
|
||||
* [CMPXCHG instruction](http://x86.renejeschke.de/html/file_module_x86_id_41.html)
|
||||
* [LOCK instruction](http://x86.renejeschke.de/html/file_module_x86_id_159.html)
|
||||
* [NOP instruction](https://en.wikipedia.org/wiki/NOP)
|
||||
* [PREFETCHW instruction](http://www.felixcloutier.com/x86/PREFETCHW.html)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html)
|
@ -4,7 +4,7 @@ Synchronization primitives in the Linux kernel. Part 3.
|
||||
Semaphores
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the third part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html) which describes synchronization primitives in the Linux kernel and in the previous part we saw special type of [spinlocks](https://en.wikipedia.org/wiki/Spinlock) - `queued spinlocks`. The previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-2.html) was the last part which describes `spinlocks` related stuff. So we need to go ahead.
|
||||
This is the third part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html) which describes synchronization primitives in the Linux kernel and in the previous part we saw special type of [spinlocks](https://en.wikipedia.org/wiki/Spinlock) - `queued spinlocks`. The previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-2.html) was the last part which describes `spinlocks` related stuff. So we need to go ahead.
|
||||
|
||||
The next [synchronization primitive](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) after `spinlock` which we will see in this part is [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29). We will start from theoretical side and will learn what is it `semaphore` and only after this, we will see how it is implemented in the Linux kernel as we did in the previous part.
|
||||
|
||||
@ -70,13 +70,13 @@ as we may see, the `DEFINE_SEMAPHORE` macro provides ability to initialize only
|
||||
}
|
||||
```
|
||||
|
||||
The `__SEMAPHORE_INITIALIZER` macro takes the name of the future `semaphore` structure and does initialization of the fields of this structure. First of all we initialize a `spinlock` of the given `semaphore` with the `__RAW_SPIN_LOCK_UNLOCKED` macro. As you may remember from the [previous](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) parts, the `__RAW_SPIN_LOCK_UNLOCKED` is defined in the [include/linux/spinlock_types.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/spinlock_types.h) header file and expands to the `__ARCH_SPIN_LOCK_UNLOCKED` macro which just expands to zero or unlocked state:
|
||||
The `__SEMAPHORE_INITIALIZER` macro takes the name of the future `semaphore` structure and does initialization of the fields of this structure. First of all we initialize a `spinlock` of the given `semaphore` with the `__RAW_SPIN_LOCK_UNLOCKED` macro. As you may remember from the [previous](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html) parts, the `__RAW_SPIN_LOCK_UNLOCKED` is defined in the [include/linux/spinlock_types.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/spinlock_types.h) header file and expands to the `__ARCH_SPIN_LOCK_UNLOCKED` macro which just expands to zero or unlocked state:
|
||||
|
||||
```C
|
||||
#define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } }
|
||||
```
|
||||
|
||||
The last two fields of the `semaphore` structure `count` and `wait_list` are initialized with the given value which represents count of available resources and empty [list](https://0xax.gitbooks.io/linux-insides/content/DataStructures/dlist.html).
|
||||
The last two fields of the `semaphore` structure `count` and `wait_list` are initialized with the given value which represents count of available resources and empty [list](https://0xax.gitbooks.io/linux-insides/content/DataStructures/linux-datastructures-1.html).
|
||||
|
||||
The second way to initialize a `semaphore` structure is to pass the `semaphore` and number of available resources to the `sema_init` function which is defined in the [include/linux/semaphore.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/semaphore.h) header file:
|
||||
|
||||
@ -106,7 +106,7 @@ The first two functions: `down` and `up` are for acquiring and releasing of the
|
||||
|
||||
The `down_killable` function does the same as the `down_interruptible` function, but set the `TASK_KILLABLE` flag for the current process. This means that the waiting process may be interrupted by the kill signal.
|
||||
|
||||
The `down_trylock` function is similar on the `spin_trylock` function. This function tries to acquire a lock and exit if this operation was unsuccessful. In this case the process which wants to acquire a lock, will not wait. The last `down_timeout` function tries to acquire a lock. It will be interrupted in a waiting state when the given timeout will be expired. Additionally, you may notice that the timeout is in [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html)
|
||||
The `down_trylock` function is similar on the `spin_trylock` function. This function tries to acquire a lock and exit if this operation was unsuccessful. In this case the process which wants to acquire a lock, will not wait. The last `down_timeout` function tries to acquire a lock. It will be interrupted in a waiting state when the given timeout will be expired. Additionally, you may notice that the timeout is in [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-1.html)
|
||||
|
||||
We just saw definitions of the `semaphore` [API](https://en.wikipedia.org/wiki/Application_programming_interface). We will start from the `down` function. This function is defined in the [kernel/locking/semaphore.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/locking/semaphore.c) source code file. Let's look on the implementation function:
|
||||
|
||||
@ -184,7 +184,7 @@ The first represents current task for the local processor which wants to acquire
|
||||
#define current get_current()
|
||||
```
|
||||
|
||||
Where the `get_current` function returns value of the `current_task` [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable:
|
||||
Where the `get_current` function returns value of the `current_task` [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) variable:
|
||||
|
||||
```C
|
||||
DECLARE_PER_CPU(struct task_struct *, current_task);
|
||||
@ -342,13 +342,13 @@ Links
|
||||
* [preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29)
|
||||
* [deadlocks](https://en.wikipedia.org/wiki/Deadlock)
|
||||
* [scheduler](https://en.wikipedia.org/wiki/Scheduling_%28computing%29)
|
||||
* [Doubly linked list in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/DataStructures/dlist.html)
|
||||
* [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html)
|
||||
* [Doubly linked list in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/DataStructures/linux-datastructures-1.html)
|
||||
* [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-1.html)
|
||||
* [interrupts](https://en.wikipedia.org/wiki/Interrupt)
|
||||
* [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html)
|
||||
* [bitmask](https://en.wikipedia.org/wiki/Mask_%28computing%29)
|
||||
* [SIGKILL](https://en.wikipedia.org/wiki/Unix_signal#SIGKILL)
|
||||
* [errno](https://en.wikipedia.org/wiki/Errno.h)
|
||||
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
* [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-2.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-2.html)
|
@ -13,7 +13,7 @@ So, let's start.
|
||||
Concept of `mutex`
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
We already familiar with the [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) synchronization primitive from the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-3.html). It represented by the:
|
||||
We already familiar with the [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) synchronization primitive from the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-3.html). It represented by the:
|
||||
|
||||
```C
|
||||
struct semaphore {
|
||||
@ -77,7 +77,7 @@ struct mutex_waiter {
|
||||
};
|
||||
```
|
||||
|
||||
structure from the [include/linux/mutex.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/mutex.h) header file and will be sleep. Before we will consider [API](https://en.wikipedia.org/wiki/Application_programming_interface) which is provided by the Linux kernel for manipulation with `mutexes`, let's consider the `mutex_waiter` structure. If you have read the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-3.html) of this chapter, you may notice that the `mutex_waiter` structure is similar to the `semaphore_waiter` structure from the [kernel/locking/semaphore.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/locking/semaphore.c) source code file:
|
||||
structure from the [include/linux/mutex.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/mutex.h) header file and will be sleep. Before we will consider [API](https://en.wikipedia.org/wiki/Application_programming_interface) which is provided by the Linux kernel for manipulation with `mutexes`, let's consider the `mutex_waiter` structure. If you have read the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-3.html) of this chapter, you may notice that the `mutex_waiter` structure is similar to the `semaphore_waiter` structure from the [kernel/locking/semaphore.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/locking/semaphore.c) source code file:
|
||||
|
||||
```C
|
||||
struct semaphore_waiter {
|
||||
@ -114,7 +114,7 @@ macro. Let's consider implementation of this macro. As we may see, the `DEFINE_M
|
||||
}
|
||||
```
|
||||
|
||||
This macro is defined in the [same](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/mutex.h) header file and as we may understand it initializes fields of the `mutex` structure the initial values. The `count` field get initialized with the `1` which represents `unlocked` state of a mutex. The `wait_lock` [spinlock](https://en.wikipedia.org/wiki/Spinlock) get initialized to the unlocked state and the last field `wait_list` to empty [doubly linked list](https://0xax.gitbooks.io/linux-insides/content/DataStructures/dlist.html).
|
||||
This macro is defined in the [same](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/mutex.h) header file and as we may understand it initializes fields of the `mutex` structure the initial values. The `count` field get initialized with the `1` which represents `unlocked` state of a mutex. The `wait_lock` [spinlock](https://en.wikipedia.org/wiki/Spinlock) get initialized to the unlocked state and the last field `wait_list` to empty [doubly linked list](https://0xax.gitbooks.io/linux-insides/content/DataStructures/linux-datastructures-1.html).
|
||||
|
||||
The second approach allows us to initialize a `mutex` dynamically. To do this we need to call the `__mutex_init` function from the [kernel/locking/mutex.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/locking/mutex.c) source code file. Actually, the `__mutex_init` function rarely called directly. Instead of the `__mutex_init`, the:
|
||||
|
||||
@ -176,7 +176,7 @@ We may see the call of the `might_sleep` macro from the [include/linux/kernel.h]
|
||||
|
||||
After the `might_sleep` macro, we may see the call of the `__mutex_fastpath_lock` function. This function is architecture-specific and as we consider [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture in this book, the implementation of the `__mutex_fastpath_lock` is located in the [arch/x86/include/asm/mutex_64.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/mutex_64.h) header file. As we may understand from the name of the `__mutex_fastpath_lock` function, this function will try to acquire lock in a fast path or in other words this function will try to decrement the value of the `count` of the given mutex.
|
||||
|
||||
Implementation of the `__mutex_fastpath_lock` function consists from two parts. The first part is [inline assembly](https://0xax.gitbooks.io/linux-insides/content/Theory/asm.html) statement. Let's look at it:
|
||||
Implementation of the `__mutex_fastpath_lock` function consists from two parts. The first part is [inline assembly](https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-3.html) statement. Let's look at it:
|
||||
|
||||
```C
|
||||
asm_volatile_goto(LOCK_PREFIX " decl %0\n"
|
||||
@ -403,7 +403,7 @@ That's all. We have considered main `API` for manipulation with `mutexes`: `mute
|
||||
* `mutex_lock_killable`;
|
||||
* `mutex_trylock`.
|
||||
|
||||
and corresponding versions of `unlock` prefixed functions. This part will not describe this `API`, because it is similar to corresponding `API` of `semaphores`. More about it you may read in the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-3.html).
|
||||
and corresponding versions of `unlock` prefixed functions. This part will not describe this `API`, because it is similar to corresponding `API` of `semaphores`. More about it you may read in the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-3.html).
|
||||
|
||||
That's all.
|
||||
|
||||
@ -429,12 +429,12 @@ Links
|
||||
* [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt)
|
||||
* [Atomic](https://en.wikipedia.org/wiki/Linearizability)
|
||||
* [MCS lock](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf)
|
||||
* [Doubly linked list](https://0xax.gitbooks.io/linux-insides/content/DataStructures/dlist.html)
|
||||
* [Doubly linked list](https://0xax.gitbooks.io/linux-insides/content/DataStructures/linux-datastructures-1.html)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [Inline assembly](https://0xax.gitbooks.io/linux-insides/content/Theory/asm.html)
|
||||
* [Inline assembly](https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-3.html)
|
||||
* [Memory barrier](https://en.wikipedia.org/wiki/Memory_barrier)
|
||||
* [Lock instruction](http://x86.renejeschke.de/html/file_module_x86_id_159.html)
|
||||
* [JNS instruction](http://unixwiz.net/techtips/x86-jumps.html)
|
||||
* [preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29)
|
||||
* [Unix signals](https://en.wikipedia.org/wiki/Unix_signal)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-3.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-3.html)
|
@ -15,7 +15,7 @@ Reader/Writer semaphore
|
||||
|
||||
Actually there are two types of operations may be performed on the data. We may read data and make changes in data. Two fundamental operations - `read` and `write`. Usually (but not always), `read` operation is performed more often than `write` operation. In this case, it would be logical to we may lock data in such way, that some processes may read locked data in one time, on condition that no one will not change the data. The [readers/writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) allows us to get this lock.
|
||||
|
||||
When a process which wants to write something into data, all other `writer` and `reader` processes will be blocked until the process which acquired a lock, will not release it. When a process reads data, other processes which want to read the same data too, will not be locked and will be able to do this. As you may guess, implementation of the `reader/writer semaphore` is based on the implementation of the `normal semaphore`. We already familiar with the [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) synchronization primitive from the third [part]((https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-4.html) of this chapter. From the theoretical side everything looks pretty simple. Let's look how `reader/writer semaphore` is represented in the Linux kernel.
|
||||
When a process which wants to write something into data, all other `writer` and `reader` processes will be blocked until the process which acquired a lock, will not release it. When a process reads data, other processes which want to read the same data too, will not be locked and will be able to do this. As you may guess, implementation of the `reader/writer semaphore` is based on the implementation of the `normal semaphore`. We already familiar with the [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) synchronization primitive from the third [part]((https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-4.html) of this chapter. From the theoretical side everything looks pretty simple. Let's look how `reader/writer semaphore` is represented in the Linux kernel.
|
||||
|
||||
The `semaphore` is represented by the:
|
||||
|
||||
@ -59,7 +59,7 @@ config RWSEM_GENERIC_SPINLOCK
|
||||
|
||||
So, as this [book](https://0xax.gitbooks.io/linux-insides/content) describes only [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture related stuff, we will skip the case when the `CONFIG_RWSEM_GENERIC_SPINLOCK` kernel configuration is enabled and consider definition of the `rw_semaphore` structure only from the [include/linux/rwsem.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/rwsem.h) header file.
|
||||
|
||||
If we will take a look at the definition of the `rw_semaphore` structure, we will notice that first three fields are the same that in the `semaphore` structure. It contains `count` field which represents amount of available resources, the `wait_list` field which represents [doubly linked list](https://0xax.gitbooks.io/linux-insides/content/DataStructures/dlist.html) of processes which are waiting to acquire a lock and `wait_lock` [spinlock](https://en.wikipedia.org/wiki/Spinlock) for protection of this list. Notice that `rw_semaphore.count` field is `long` type unlike the same field in the `semaphore` structure.
|
||||
If we will take a look at the definition of the `rw_semaphore` structure, we will notice that first three fields are the same that in the `semaphore` structure. It contains `count` field which represents amount of available resources, the `wait_list` field which represents [doubly linked list](https://0xax.gitbooks.io/linux-insides/content/DataStructures/linux-datastructures-1.html) of processes which are waiting to acquire a lock and `wait_lock` [spinlock](https://en.wikipedia.org/wiki/Spinlock) for protection of this list. Notice that `rw_semaphore.count` field is `long` type unlike the same field in the `semaphore` structure.
|
||||
|
||||
The `count` field of a `rw_semaphore` structure may have following values:
|
||||
|
||||
@ -70,7 +70,7 @@ The `count` field of a `rw_semaphore` structure may have following values:
|
||||
* `0xffffffff00000000` - represents situation when there are readers or writers are queued, but no one is active or is in the process of acquire of a lock;
|
||||
* `0xfffffffe00000001` - a writer is active or attempting to acquire a lock and waiters are in queue.
|
||||
|
||||
So, besides the `count` field, all of these fields are similar to fields of the `semaphore` structure. Last three fields depend on the two configuration options of the Linux kernel: the `CONFIG_RWSEM_SPIN_ON_OWNER` and `CONFIG_DEBUG_LOCK_ALLOC`. The first two fields may be familiar us by declaration of the [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) structure from the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-4.html). The first `osq` field represents [MCS lock](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) spinner for `optimistic spinning` and the second represents process which is current owner of a lock.
|
||||
So, besides the `count` field, all of these fields are similar to fields of the `semaphore` structure. Last three fields depend on the two configuration options of the Linux kernel: the `CONFIG_RWSEM_SPIN_ON_OWNER` and `CONFIG_DEBUG_LOCK_ALLOC`. The first two fields may be familiar us by declaration of the [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) structure from the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-4.html). The first `osq` field represents [MCS lock](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) spinner for `optimistic spinning` and the second represents process which is current owner of a lock.
|
||||
|
||||
The last field of the `rw_semaphore` structure is - `dep_map` - debugging related, and as I already wrote in previous parts, we will skip debugging related stuff in this chapter.
|
||||
|
||||
@ -193,7 +193,7 @@ void __sched down_write(struct rw_semaphore *sem)
|
||||
}
|
||||
```
|
||||
|
||||
We already met the `might_sleep` macro in the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-4.html). In short words, Implementation of the `might_sleep` macro depends on the `CONFIG_DEBUG_ATOMIC_SLEEP` kernel configuration option and if this option is enabled, this macro just prints a stack trace if it was executed in [atomic](https://en.wikipedia.org/wiki/Linearizability) context. As this macro is mostly for debugging purpose we will skip it and will go ahead. Additionally we will skip the next macro from the `down_read` function - `rwsem_acquire` which is related to the [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) of the Linux kernel, because this is topic of other part.
|
||||
We already met the `might_sleep` macro in the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-4.html). In short words, Implementation of the `might_sleep` macro depends on the `CONFIG_DEBUG_ATOMIC_SLEEP` kernel configuration option and if this option is enabled, this macro just prints a stack trace if it was executed in [atomic](https://en.wikipedia.org/wiki/Linearizability) context. As this macro is mostly for debugging purpose we will skip it and will go ahead. Additionally we will skip the next macro from the `down_read` function - `rwsem_acquire` which is related to the [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) of the Linux kernel, because this is topic of other part.
|
||||
|
||||
The only two things that remained in the `down_write` function is the call of the `LOCK_CONTENDED` macro which is defined in the [include/linux/lockdep.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/lockdep.h) header file and setting of owner of a lock with the `rwsem_set_owner` function which sets owner to currently running process:
|
||||
|
||||
@ -240,7 +240,7 @@ static inline void __down_write_nested(struct rw_semaphore *sem, int subclass)
|
||||
}
|
||||
```
|
||||
|
||||
As for other synchronization primitives which we saw in this chapter, usually `lock/unlock` functions consists only from an [inline assembly](https://0xax.gitbooks.io/linux-insides/content/Theory/asm.html) statement. As we may see, in our case the same for `__down_write_nested` function. Let's try to understand what does this function do. The first line of our assembly statement is just a comment, let's skip it. The second like contains `LOCK_PREFIX` which will be expanded to the [LOCK](http://x86.renejeschke.de/html/file_module_x86_id_159.html) instruction as we already know. The next [xadd](http://x86.renejeschke.de/html/file_module_x86_id_327.html) instruction executes `add` and `exchange` operations. In other words, `xadd` instruction adds value of the `RWSEM_ACTIVE_WRITE_BIAS`:
|
||||
As for other synchronization primitives which we saw in this chapter, usually `lock/unlock` functions consists only from an [inline assembly](https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-3.html) statement. As we may see, in our case the same for `__down_write_nested` function. Let's try to understand what does this function do. The first line of our assembly statement is just a comment, let's skip it. The second like contains `LOCK_PREFIX` which will be expanded to the [LOCK](http://x86.renejeschke.de/html/file_module_x86_id_159.html) instruction as we already know. The next [xadd](http://x86.renejeschke.de/html/file_module_x86_id_327.html) instruction executes `add` and `exchange` operations. In other words, `xadd` instruction adds value of the `RWSEM_ACTIVE_WRITE_BIAS`:
|
||||
|
||||
```C
|
||||
#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
|
||||
@ -292,7 +292,7 @@ if (rwsem_optimistic_spin(sem))
|
||||
return sem;
|
||||
```
|
||||
|
||||
We will skip implementation of the `rwsem_optimistic_spin` function, as it is similar on the `mutex_optimistic_spin` function which we saw in the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-4.html). In short words we check existence other tasks ready to run that have higher priority in the `rwsem_optimistic_spin` function. If there are such tasks, the process will be added to the [MCS](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) `waitqueue` and start to spin in the loop until a lock will be able to be acquired. If `optimistic spinning` is disabled, a process will be added to the and marked as waiting for write:
|
||||
We will skip implementation of the `rwsem_optimistic_spin` function, as it is similar on the `mutex_optimistic_spin` function which we saw in the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-4.html). In short words we check existence other tasks ready to run that have higher priority in the `rwsem_optimistic_spin` function. If there are such tasks, the process will be added to the [MCS](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) `waitqueue` and start to spin in the loop until a lock will be able to be acquired. If `optimistic spinning` is disabled, a process will be added to the and marked as waiting for write:
|
||||
|
||||
```C
|
||||
waiter.task = current;
|
||||
@ -325,7 +325,7 @@ while (true) {
|
||||
}
|
||||
```
|
||||
|
||||
I will skip explanation of this loop as we already met similar functional in the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-4.html).
|
||||
I will skip explanation of this loop as we already met similar functional in the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-4.html).
|
||||
|
||||
That's all. From this moment, our `writer` process will acquire or not acquire a lock depends on the value of the `rw_semaphore->count` field. Now if we will look at the implementation of the `down_read` function which executes a try of acquiring of a lock. We will see similar actions which we saw in the `down_write` function. This function calls different debugging and lock validator related functions/macros:
|
||||
|
||||
@ -422,12 +422,12 @@ Links
|
||||
* [Semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29)
|
||||
* [Mutex](https://en.wikipedia.org/wiki/Mutual_exclusion)
|
||||
* [x86_64 architecture](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [Doubly linked list](https://0xax.gitbooks.io/linux-insides/content/DataStructures/dlist.html)
|
||||
* [Doubly linked list](https://0xax.gitbooks.io/linux-insides/content/DataStructures/linux-datastructures-1.html)
|
||||
* [MCS lock](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf)
|
||||
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
* [Linux kernel lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt)
|
||||
* [Atomic operations](https://en.wikipedia.org/wiki/Linearizability)
|
||||
* [Inline assembly](https://0xax.gitbooks.io/linux-insides/content/Theory/asm.html)
|
||||
* [Inline assembly](https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-3.html)
|
||||
* [XADD instruction](http://x86.renejeschke.de/html/file_module_x86_id_327.html)
|
||||
* [LOCK instruction](http://x86.renejeschke.de/html/file_module_x86_id_159.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-4.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-4.html)
|
@ -6,7 +6,7 @@ Introduction
|
||||
|
||||
This is the sixth part of the chapter which describes [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_\(computer_science\)) in the Linux kernel and in the previous parts we finished to consider different [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) synchronization primitives. We will continue to learn synchronization primitives in this part and start to consider a similar synchronization primitive which can be used to avoid the `writer starvation` problem. The name of this synchronization primitive is - `seqlock` or `sequential locks`.
|
||||
|
||||
We know from the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-5.html) that [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) is a special lock mechanism which allows concurrent access for read-only operations, but an exclusive lock is needed for writing or modifying data. As we may guess, it may lead to a problem which is called `writer starvation`. In other words, a writer process can't acquire a lock as long as at least one reader process which acquired a lock holds it. So, in the situation when contention is high, it will lead to situation when a writer process which wants to acquire a lock will wait for it for a long time.
|
||||
We know from the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-5.html) that [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) is a special lock mechanism which allows concurrent access for read-only operations, but an exclusive lock is needed for writing or modifying data. As we may guess, it may lead to a problem which is called `writer starvation`. In other words, a writer process can't acquire a lock as long as at least one reader process which acquired a lock holds it. So, in the situation when contention is high, it will lead to situation when a writer process which wants to acquire a lock will wait for it for a long time.
|
||||
|
||||
The `seqlock` synchronization primitive can help solve this problem.
|
||||
|
||||
@ -17,9 +17,9 @@ So, let's start.
|
||||
Sequential lock
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
So, what is a `seqlock` synchronization primitive and how does it work? Let's try to answer on these questions in this paragraph. Actually `sequential locks` were introduced in the Linux kernel 2.6.x. Main point of this synchronization primitive is to provide fast and lock-free access to shared resources. Since the heart of `sequential lock` synchronization primitive is [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) synchronization primitive, `sequential locks` work in situations where the protected resources are small and simple. Additionally write access must be rare and also should be fast.
|
||||
So, what is a `seqlock` synchronization primitive and how does it work? Let's try to answer on these questions in this paragraph. Actually `sequential locks` were introduced in the Linux kernel 2.6.x. Main point of this synchronization primitive is to provide fast and lock-free access to shared resources. Since the heart of `sequential lock` synchronization primitive is [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html) synchronization primitive, `sequential locks` work in situations where the protected resources are small and simple. Additionally write access must be rare and also should be fast.
|
||||
|
||||
Work of this synchronization primitive is based on the sequence of events counter. Actually a `sequential lock` allows free access to a resource for readers, but each reader must check existence of conflicts with a writer. This synchronization primitive introduces a special counter. The main algorithm of work of `sequential locks` is simple: Each writer which acquired a sequential lock increments this counter and additionally acquires a [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html). When this writer finishes, it will release the acquired spinlock to give access to other writers and increment the counter of a sequential lock again.
|
||||
Work of this synchronization primitive is based on the sequence of events counter. Actually a `sequential lock` allows free access to a resource for readers, but each reader must check existence of conflicts with a writer. This synchronization primitive introduces a special counter. The main algorithm of work of `sequential locks` is simple: Each writer which acquired a sequential lock increments this counter and additionally acquires a [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html). When this writer finishes, it will release the acquired spinlock to give access to other writers and increment the counter of a sequential lock again.
|
||||
|
||||
Read only access works on the following principle, it gets the value of a `sequential lock` counter before it will enter into [critical section](https://en.wikipedia.org/wiki/Critical_section) and compares it with the value of the same `sequential lock` counter at the exit of critical section. If their values are equal, this means that there weren't writers for this period. If their values are not equal, this means that a writer has incremented the counter during the [critical section](https://en.wikipedia.org/wiki/Critical_section). This conflict means that reading of protected data must be repeated.
|
||||
|
||||
@ -54,7 +54,7 @@ typedef struct {
|
||||
} seqlock_t;
|
||||
```
|
||||
|
||||
As we may see the `seqlock_t` provides two fields. These fields represent a sequential lock counter, description of which we saw above and also a [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) which will protect data from other writers. Note that the `seqcount` counter represented as `seqcount` type. The `seqcount` is structure:
|
||||
As we may see the `seqlock_t` provides two fields. These fields represent a sequential lock counter, description of which we saw above and also a [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html) which will protect data from other writers. Note that the `seqcount` counter represented as `seqcount` type. The `seqcount` is structure:
|
||||
|
||||
```C
|
||||
typedef struct seqcount {
|
||||
@ -114,7 +114,7 @@ So we just initialize counter of the given sequential lock to zero and additiona
|
||||
#endif
|
||||
```
|
||||
|
||||
As I already wrote in previous parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/) we will not consider [debugging](https://en.wikipedia.org/wiki/Debugging) and [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) related stuff in this part. So for now we just skip the `SEQCOUNT_DEP_MAP_INIT` macro. The second field of the given `seqlock_t` is `lock` initialized with the `__SPIN_LOCK_UNLOCKED` macro which is defined in the [include/linux/spinlock_types.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/spinlock_types.h) header file. We will not consider implementation of this macro here as it just initialize [rawspinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) with architecture-specific methods (More abot spinlocks you may read in first parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/)).
|
||||
As I already wrote in previous parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/) we will not consider [debugging](https://en.wikipedia.org/wiki/Debugging) and [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) related stuff in this part. So for now we just skip the `SEQCOUNT_DEP_MAP_INIT` macro. The second field of the given `seqlock_t` is `lock` initialized with the `__SPIN_LOCK_UNLOCKED` macro which is defined in the [include/linux/spinlock_types.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/spinlock_types.h) header file. We will not consider implementation of this macro here as it just initialize [rawspinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html) with architecture-specific methods (More abot spinlocks you may read in first parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/)).
|
||||
|
||||
We have considered the first way to initialize a sequential lock. Let's consider second way to do the same, but do it dynamically. We can initialize a sequential lock with the `seqlock_init` macro which is defined in the same [include/linux/seqlock.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/seqlock.h) header file.
|
||||
|
||||
@ -149,7 +149,7 @@ static inline void __seqcount_init(seqcount_t *s, const char *name,
|
||||
}
|
||||
```
|
||||
|
||||
just initializes counter of the given `seqcount_t` with zero. The second call from the `seqlock_init` macro is the call of the `spin_lock_init` macro which we saw in the [first part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) of this chapter.
|
||||
just initializes counter of the given `seqcount_t` with zero. The second call from the `seqlock_init` macro is the call of the `spin_lock_init` macro which we saw in the [first part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html) of this chapter.
|
||||
|
||||
So, now we know how to initialize a `sequential lock`, now let's look at how to use it. The Linux kernel provides following [API](https://en.wikipedia.org/wiki/Application_programming_interface) to manipulate `sequential locks`:
|
||||
|
||||
@ -223,7 +223,7 @@ static inline int __read_seqcount_retry(const seqcount_t *s, unsigned start)
|
||||
|
||||
which just compares value of the counter of the given `sequential lock` with the initial value of this counter. If the initial value of the counter which is obtained from `read_seqbegin()` function is odd, this means that a writer was in the middle of updating the data when our reader began to act. In this case the value of the data can be in inconsistent state, so we need to try to read it again.
|
||||
|
||||
This is a common pattern in the Linux kernel. For example, you may remember the `jiffies` concept from the [first part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) of the [timers and time management in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Timers/) chapter. The sequential lock is used to obtain value of `jiffies` at [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture:
|
||||
This is a common pattern in the Linux kernel. For example, you may remember the `jiffies` concept from the [first part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-1.html) of the [timers and time management in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Timers/) chapter. The sequential lock is used to obtain value of `jiffies` at [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture:
|
||||
|
||||
```C
|
||||
u64 get_jiffies_64(void)
|
||||
@ -303,7 +303,7 @@ static inline void raw_write_seqcount_end(seqcount_t *s)
|
||||
|
||||
and in the end we just call the `spin_unlock` macro to give access for other readers or writers.
|
||||
|
||||
That's all about `sequential lock` mechanism in the Linux kernel. Of course we did not consider full [API](https://en.wikipedia.org/wiki/Application_programming_interface) of this mechanism in this part. But all other functions are based on these which we described here. For example, Linux kernel also provides some safe macros/functions to use `sequential lock` mechanism in [interrupt handlers](https://en.wikipedia.org/wiki/Interrupt_handler) of [softirq](https://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html): `write_seqclock_irq` and `write_sequnlock_irq`:
|
||||
That's all about `sequential lock` mechanism in the Linux kernel. Of course we did not consider full [API](https://en.wikipedia.org/wiki/Application_programming_interface) of this mechanism in this part. But all other functions are based on these which we described here. For example, Linux kernel also provides some safe macros/functions to use `sequential lock` mechanism in [interrupt handlers](https://en.wikipedia.org/wiki/Interrupt_handler) of [softirq](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-9.html): `write_seqclock_irq` and `write_sequnlock_irq`:
|
||||
|
||||
```C
|
||||
static inline void write_seqlock_irq(seqlock_t *sl)
|
||||
@ -339,7 +339,7 @@ Links
|
||||
|
||||
* [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_\(computer_science\))
|
||||
* [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock)
|
||||
* [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html)
|
||||
* [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html)
|
||||
* [critical section](https://en.wikipedia.org/wiki/Critical_section)
|
||||
* [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt)
|
||||
* [debugging](https://en.wikipedia.org/wiki/Debugging)
|
||||
@ -347,6 +347,6 @@ Links
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [Timers and time management in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Timers/)
|
||||
* [interrupt handlers](https://en.wikipedia.org/wiki/Interrupt_handler)
|
||||
* [softirq](https://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html)
|
||||
* [softirq](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-9.html)
|
||||
* [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_\(PC_architecture\))
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-5.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-5.html)
|
@ -2,9 +2,9 @@
|
||||
|
||||
This chapter describes the `system call` concept in the linux kernel.
|
||||
|
||||
* [Introduction to system call concept](syscall-1.md) - this part is introduction to the `system call` concept in the Linux kernel.
|
||||
* [How the Linux kernel handles a system call](syscall-2.md) - this part describes how the Linux kernel handles a system call from a userspace application.
|
||||
* [vsyscall and vDSO](syscall-3.md) - third part describes `vsyscall` and `vDSO` concepts.
|
||||
* [How the Linux kernel runs a program](syscall-4.md) - this part describes startup process of a program.
|
||||
* [Implementation of the open system call](syscall-5.md) - this part describes implementation of the [open](http://man7.org/linux/man-pages/man2/open.2.html) system call.
|
||||
* [Limits on resources in Linux](https://github.com/0xAX/linux-insides/blob/master/SysCall/syscall-6.md) - this part describes implementation of the [getrlimit/setrlimit](https://linux.die.net/man/2/getrlimit) system calls.
|
||||
* [Introduction to system call concept](linux-syscall-1.md) - this part is introduction to the `system call` concept in the Linux kernel.
|
||||
* [How the Linux kernel handles a system call](linux-syscall-2.md) - this part describes how the Linux kernel handles a system call from a userspace application.
|
||||
* [vsyscall and vDSO](linux-syscall-3.md) - third part describes `vsyscall` and `vDSO` concepts.
|
||||
* [How the Linux kernel runs a program](linux-syscall-4.md) - this part describes startup process of a program.
|
||||
* [Implementation of the open system call](linux-syscall-5.md) - this part describes implementation of the [open](http://man7.org/linux/man-pages/man2/open.2.html) system call.
|
||||
* [Limits on resources in Linux](linux-syscall-6.md) - this part describes implementation of the [getrlimit/setrlimit](https://linux.die.net/man/2/getrlimit) system calls.
|
||||
|
@ -4,7 +4,7 @@ System calls in the Linux kernel. Part 1.
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This post opens up a new chapter in [linux-insides](http://0xax.gitbooks.io/linux-insides/content/) book, and as you may understand from the title, this chapter will be devoted to the [System call](https://en.wikipedia.org/wiki/System_call) concept in the Linux kernel. The choice of topic for this chapter is not accidental. In the previous [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) we saw interrupts and interrupt handling. The concept of system calls is very similar to that of interrupts. This is because the most common way to implement system calls is as software interrupts. We will see many different aspects that are related to the system call concept. For example, we will learn what's happening when a system call occurs from userspace. We will see an implementation of a couple system call handlers in the Linux kernel, [VDSO](https://en.wikipedia.org/wiki/VDSO) and [vsyscall](https://lwn.net/Articles/446528/) concepts and many many more.
|
||||
This post opens up a new chapter in [linux-insides](https://0xax.gitbooks.io/linux-insides/content/) book, and as you may understand from the title, this chapter will be devoted to the [System call](https://en.wikipedia.org/wiki/System_call) concept in the Linux kernel. The choice of topic for this chapter is not accidental. In the previous [chapter](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) we saw interrupts and interrupt handling. The concept of system calls is very similar to that of interrupts. This is because the most common way to implement system calls is as software interrupts. We will see many different aspects that are related to the system call concept. For example, we will learn what's happening when a system call occurs from userspace. We will see an implementation of a couple system call handlers in the Linux kernel, [VDSO](https://en.wikipedia.org/wiki/VDSO) and [vsyscall](https://lwn.net/Articles/446528/) concepts and many many more.
|
||||
|
||||
Before we dive into Linux system call implementation, it is good to know some theory about system calls. Let's do it in the following paragraph.
|
||||
|
||||
@ -76,15 +76,19 @@ by those selector values correspond to the fixed values loaded into the descript
|
||||
caches; the SYSCALL instruction does not ensure this correspondence.
|
||||
```
|
||||
|
||||
and we are initializing `syscalls` by the writing of the `entry_SYSCALL_64` that defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/entry_64.S) assembler file and represents `SYSCALL` instruction entry to the `IA32_STAR` [Model specific register](https://en.wikipedia.org/wiki/Model-specific_register):
|
||||
|
||||
To summarize, the `syscall` instruction jumps to the address stored in the `MSR_LSTAR` [Model specific register](https://en.wikipedia.org/wiki/Model-specific_register) (Long system target address register). The kernel is responsible for providing its own custom function for handling syscalls as well as writing the address of this handler function to the `MSR_LSTAR` register upon system startup.
|
||||
The custom function is `entry_SYSCALL_64`, which is defined in [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/entry_64.S#L98). The address of this syscall handling function is written to the `MSR_LSTAR` register during startup in [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/cpu/common.c#L1335).
|
||||
```C
|
||||
wrmsrl(MSR_LSTAR, entry_SYSCALL_64);
|
||||
```
|
||||
|
||||
in the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/cpu/common.c) source code file.
|
||||
So, the `syscall` instruction invokes a handler of a given system call. But how does it know which handler to call? Actually it gets this information from the general purpose [registers](https://en.wikipedia.org/wiki/Processor_register). As you can see in the system call [table](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/syscalls/syscall_64.tbl), each system call has a unique number. In our example the first system call is `write`, which writes data to the given file. Let's look in the system call table and try to find the `write` system call. As we can see, the [write](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/syscalls/syscall_64.tbl#L10) system call has number `1`. We pass the number of this system call through the `rax` register in our example. The next general purpose registers: `%rdi`, `%rsi`, and `%rdx` take the three parameters of the `write` syscall. In our case, they are:
|
||||
|
||||
So, the `syscall` instruction invokes a handler of a given system call. But how does it know which handler to call? Actually it gets this information from the general purpose [registers](https://en.wikipedia.org/wiki/Processor_register). As you can see in the system call [table](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/syscalls/syscall_64.tbl), each system call has a unique number. In our example, first system call is - `write` that writes data to the given file. Let's look in the system call table and try to find `write` system call. As we can see, the [write](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/syscalls/syscall_64.tbl#L10) system call has number - `1`. We pass the number of this system call through the `rax` register in our example. The next general purpose registers: `%rdi`, `%rsi` and `%rdx` take parameters of the `write` syscall. In our case, they are [file descriptor](https://en.wikipedia.org/wiki/File_descriptor) (`1` is [stdout](https://en.wikipedia.org/wiki/Standard_streams#Standard_output_.28stdout.29) in our case), second parameter is the pointer to our string, and the third is size of data. Yes, you heard right. Parameters for a system call. As I already wrote above, a system call is a just `C` function in the kernel space. In our case first system call is write. This system call defined in the [fs/read_write.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/fs/read_write.c) source code file and looks like:
|
||||
* [File descriptor](https://en.wikipedia.org/wiki/File_descriptor) (`1` is [stdout](https://en.wikipedia.org/wiki/Standard_streams#Standard_output_.28stdout.29) in our case)
|
||||
* Pointer to our string
|
||||
* Size of data
|
||||
|
||||
Yes, you heard right. Parameters for a system call. As I already wrote above, a system call is a just `C` function in the kernel space. In our case first system call is write. This system call defined in the [fs/read_write.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/fs/read_write.c) source code file and looks like:
|
||||
|
||||
```C
|
||||
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
|
||||
@ -104,7 +108,7 @@ ssize_t write(int fd, const void *buf, size_t nbytes);
|
||||
|
||||
Don't worry about the `SYSCALL_DEFINE3` macro for now, we'll come back to it.
|
||||
|
||||
The second part of our example is the same, but we call other system call. In this case we call [exit](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/syscalls/syscall_64.tbl#L69) system call. This system call gets only one parameter:
|
||||
The second part of our example is the same, but we call another system call. In this case we call the [exit](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/syscalls/syscall_64.tbl#L69) system call. This system call gets only one parameter:
|
||||
|
||||
* Return value
|
||||
|
||||
@ -120,18 +124,18 @@ _exit(0) = ?
|
||||
+++ exited with 0 +++
|
||||
```
|
||||
|
||||
In the first line of the `strace` output, we can see [execve](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/syscalls/syscall_64.tbl#L68) system call that executes our program, and the second and third are system calls that we have used in our program: `write` and `exit`. Note that we pass the parameter through the general purpose registers in our example. The order of the registers is not accidental. The order of the registers is defined by the following agreement - [x86-64 calling conventions](https://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions). This and other agreement for the `x86_64` architecture explained in the special document - [System V Application Binary Interface. PDF](https://github.com/hjl-tools/x86-psABI/wiki/x86-64-psABI-r252.pdf). In a general way, argument(s) of a function are placed either in registers or pushed on the stack. The right order is:
|
||||
In the first line of the `strace` output, we can see the [execve](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/syscalls/syscall_64.tbl#L68) system call that executes our program, and the second and third are system calls that we have used in our program: `write` and `exit`. Note that we pass the parameter through the general purpose registers in our example. The order of the registers is not accidental. The order of the registers is defined by the following agreement - [x86-64 calling conventions](https://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions). This, and the other agreement for the `x86_64` architecture are explained in the special document - [System V Application Binary Interface. PDF](https://github.com/hjl-tools/x86-psABI/wiki/x86-64-psABI-r252.pdf). In a general way, argument(s) of a function are placed either in registers or pushed on the stack. The right order is:
|
||||
|
||||
* `rdi`;
|
||||
* `rsi`;
|
||||
* `rdx`;
|
||||
* `rcx`;
|
||||
* `r8`;
|
||||
* `r9`.
|
||||
* `rdi`
|
||||
* `rsi`
|
||||
* `rdx`
|
||||
* `rcx`
|
||||
* `r8`
|
||||
* `r9`
|
||||
|
||||
for the first six parameters of a function. If a function has more than six arguments, other parameters will be placed on the stack.
|
||||
for the first six parameters of a function. If a function has more than six arguments, the remaining parameters will be placed on the stack.
|
||||
|
||||
We do not use system calls in our code directly, but our program uses it when we want to print something, check access to a file or just write or read something to it.
|
||||
We do not use system calls in our code directly, but our program uses them when we want to print something, check access to a file or just write or read something to it.
|
||||
|
||||
For example:
|
||||
|
||||
@ -152,13 +156,13 @@ int main(int argc, char **argv)
|
||||
}
|
||||
```
|
||||
|
||||
There are no `fopen`, `fgets`, `printf` and `fclose` system calls in the Linux kernel, but `open`, `read` `write` and `close` instead. I think you know that these four functions `fopen`, `fgets`, `printf` and `fclose` are just functions that defined in the `C` [standard library](https://en.wikipedia.org/wiki/GNU_C_Library). Actually these functions are wrappers for the system calls. We do not call system calls directly in our code, but using [wrapper](https://en.wikipedia.org/wiki/Wrapper_function) functions from the standard library. The main reason of this is simple: a system call must be performed quickly, very quickly. As a system call must be quick, it must be small. The standard library takes responsibility to perform system calls with the correct set parameters and makes different checks before it will call the given system call. Let's compile our program with the following command:
|
||||
There are no `fopen`, `fgets`, `printf`, and `fclose` system calls in the Linux kernel, but `open`, `read`, `write`, and `close` instead. I think you know that `fopen`, `fgets`, `printf`, and `fclose` are defined in the `C` [standard library](https://en.wikipedia.org/wiki/GNU_C_Library). Actually, these functions are just wrappers for the system calls. We do not call system calls directly in our code, but instead use these [wrapper](https://en.wikipedia.org/wiki/Wrapper_function) functions from the standard library. The main reason of this is simple: a system call must be performed quickly, very quickly. As a system call must be quick, it must be small. The standard library takes responsibility to perform system calls with the correct parameters and makes different checks before it will call the given system call. Let's compile our program with the following command:
|
||||
|
||||
```
|
||||
$ gcc test.c -o test
|
||||
```
|
||||
|
||||
and look on it with the [ltrace](https://en.wikipedia.org/wiki/Ltrace) util:
|
||||
and examine it with the [ltrace](https://en.wikipedia.org/wiki/Ltrace) util:
|
||||
|
||||
```
|
||||
$ ltrace ./test
|
||||
@ -172,13 +176,13 @@ fclose(0x602010) = 0
|
||||
+++ exited (status 0) +++
|
||||
```
|
||||
|
||||
The `ltrace` util displays a set of userspace calls of a program. The `fopen` function opens the given text file, the `fgets` reads file content to the `buf` buffer, the `puts` function prints it to the `stdout` and the `fclose` function closes file by the given file descriptor. And as I already wrote, all of these functions call an appropriate system call. For example `puts` calls the `write` system call inside, we can see it if we will add `-S` option to the `ltrace` program:
|
||||
The `ltrace` util displays a set of userspace calls of a program. The `fopen` function opens the given text file, the `fgets` function reads file content to the `buf` buffer, the `puts` function prints the buffer to `stdout`, and the `fclose` function closes the file given by the file descriptor. And as I already wrote, all of these functions call an appropriate system call. For example, `puts` calls the `write` system call inside, we can see it if we will add `-S` option to the `ltrace` program:
|
||||
|
||||
```
|
||||
write@SYS(1, "Hello World!\n\n", 14) = 14
|
||||
```
|
||||
|
||||
Yes, system calls are ubiquitous. Each program needs to open/write/read file, network connection, allocate memory and many other things that can be provided only by the kernel. The [proc](https://en.wikipedia.org/wiki/Procfs) file system contains special files in a format: `/proc/pid/systemcall` that exposes the system call number and argument registers for the system call currently being executed by the process. For example, pid 1, that is [systemd](https://en.wikipedia.org/wiki/Systemd) for me:
|
||||
Yes, system calls are ubiquitous. Each program needs to open/write/read files and network connections, allocate memory, and many other things that can be provided only by the kernel. The [proc](https://en.wikipedia.org/wiki/Procfs) file system contains special files in a format: `/proc/pid/systemcall` that exposes the system call number and argument registers for the system call currently being executed by the process. For example, pid 1 is [systemd](https://en.wikipedia.org/wiki/Systemd) for me:
|
||||
|
||||
```
|
||||
$ sudo cat /proc/1/comm
|
||||
@ -412,4 +416,4 @@ Links
|
||||
* [Virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system)
|
||||
* [systemd](https://en.wikipedia.org/wiki/Systemd)
|
||||
* [epoll](https://en.wikipedia.org/wiki/Epoll)
|
||||
* [Previous chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html)
|
||||
* [Previous chapter](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html)
|
@ -4,7 +4,7 @@ System calls in the Linux kernel. Part 2.
|
||||
How does the Linux kernel handle a system call
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) was the first part of the chapter that describes the [system call](https://en.wikipedia.org/wiki/System_call) concepts in the Linux kernel.
|
||||
The previous [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-1.html) was the first part of the chapter that describes the [system call](https://en.wikipedia.org/wiki/System_call) concepts in the Linux kernel.
|
||||
In the previous part we learned what a system call is in the Linux kernel, and in operating systems in general. This was introduced from a user-space perspective, and part of the [write](http://man7.org/linux/man-pages/man2/write.2.html) system call implementation was discussed. In this part we continue our look at system calls, starting with some theory before moving onto the Linux kernel code.
|
||||
|
||||
A user application does not make the system call directly from our applications. We did not write the `Hello world!` program like:
|
||||
@ -114,7 +114,7 @@ asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
|
||||
|
||||
After this all elements that point to the non-implemented system calls will contain the address of the `sys_ni_syscall` function that just returns `-ENOSYS` as we saw above, and other elements will point to the `sys_syscall_name` functions.
|
||||
|
||||
At this point, we have filled the system call table and the Linux kernel knows where each system call handler is. But the Linux kernel does not call a `sys_syscall_name` function immediately after it is instructed to handle a system call from a user space application. Remember the [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) about interrupts and interrupt handling. When the Linux kernel gets the control to handle an interrupt, it had to do some preparations like save user space registers, switch to a new stack and many more tasks before it will call an interrupt handler. There is the same situation with the system call handling. The preparation for handling a system call is the first thing, but before the Linux kernel will start these preparations, the entry point of a system call must be initialized and only the Linux kernel knows how to perform this preparation. In the next paragraph we will see the process of the initialization of the system call entry in the Linux kernel.
|
||||
At this point, we have filled the system call table and the Linux kernel knows where each system call handler is. But the Linux kernel does not call a `sys_syscall_name` function immediately after it is instructed to handle a system call from a user space application. Remember the [chapter](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) about interrupts and interrupt handling. When the Linux kernel gets the control to handle an interrupt, it had to do some preparations like save user space registers, switch to a new stack and many more tasks before it will call an interrupt handler. There is the same situation with the system call handling. The preparation for handling a system call is the first thing, but before the Linux kernel will start these preparations, the entry point of a system call must be initialized and only the Linux kernel knows how to perform this preparation. In the next paragraph we will see the process of the initialization of the system call entry in the Linux kernel.
|
||||
|
||||
Initialization of the system call entry
|
||||
--------------------------------------------------------------------------------
|
||||
@ -126,7 +126,7 @@ SYSCALL invokes an OS system-call handler at privilege level 0.
|
||||
It does so by loading RIP from the IA32_LSTAR MSR
|
||||
```
|
||||
|
||||
it means that we need to put the system call entry in to the `IA32_LSTAR` [model specific register](https://en.wikipedia.org/wiki/Model-specific_register). This operation takes place during the Linux kernel initialization process. If you have read the fourth [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-4.html) of the chapter that describes interrupts and interrupt handling in the Linux kernel, you know that the Linux kernel calls the `trap_init` function during the initialization process. This function is defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c) source code file and executes the initialization of the `non-early` exception handlers like divide error, [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) error etc. Besides the initialization of the `non-early` exceptions handlers, this function calls the `cpu_init` function from the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/blob/arch/x86/kernel/cpu/common.c) source code file which besides initialization of `per-cpu` state, calls the `syscall_init` function from the same source code file.
|
||||
it means that we need to put the system call entry in to the `IA32_LSTAR` [model specific register](https://en.wikipedia.org/wiki/Model-specific_register). This operation takes place during the Linux kernel initialization process. If you have read the fourth [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-4.html) of the chapter that describes interrupts and interrupt handling in the Linux kernel, you know that the Linux kernel calls the `trap_init` function during the initialization process. This function is defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c) source code file and executes the initialization of the `non-early` exception handlers like divide error, [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) error etc. Besides the initialization of the `non-early` exceptions handlers, this function calls the `cpu_init` function from the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/blob/arch/x86/kernel/cpu/common.c) source code file which besides initialization of `per-cpu` state, calls the `syscall_init` function from the same source code file.
|
||||
|
||||
This function performs the initialization of the system call entry point. Let's look on the implementation of this function. It does not take parameters and first of all it fills two model specific registers:
|
||||
|
||||
@ -181,7 +181,7 @@ wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
|
||||
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
|
||||
```
|
||||
|
||||
You can read more about the `Global Descriptor Table` in the second [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) of the chapter that describes the booting process of the Linux kernel.
|
||||
You can read more about the `Global Descriptor Table` in the second [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) of the chapter that describes the booting process of the Linux kernel.
|
||||
|
||||
At the end of the `syscall_init` function, we just mask flags in the [flags register](https://en.wikipedia.org/wiki/FLAGS_register) by writing the set of flags to the `MSR_SYSCALL_MASK` model specific register:
|
||||
|
||||
@ -210,7 +210,7 @@ This macro is defined in the [arch/x86/include/asm/irqflags.h](https://github.co
|
||||
#define SWAPGS_UNSAFE_STACK swapgs
|
||||
```
|
||||
|
||||
which exchanges the current GS base register value with the value contained in the `MSR_KERNEL_GS_BASE ` model specific register. In other words we moved it on to the kernel stack. After this we point the old stack pointer to the `rsp_scratch` [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable and setup the stack pointer to point to the top of stack for the current processor:
|
||||
which exchanges the current GS base register value with the value contained in the `MSR_KERNEL_GS_BASE ` model specific register. In other words we moved it on to the kernel stack. After this we point the old stack pointer to the `rsp_scratch` [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) variable and setup the stack pointer to point to the top of stack for the current processor:
|
||||
|
||||
```assembly
|
||||
movq %rsp, PER_CPU_VAR(rsp_scratch)
|
||||
@ -378,7 +378,7 @@ That's all.
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the second part about the system calls concept in the Linux kernel. In the previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) we saw theory about this concept from the user application view. In this part we continued to dive into the stuff which is related to the system call concept and saw what the Linux kernel does when a system call occurs.
|
||||
This is the end of the second part about the system calls concept in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-1.html) we saw theory about this concept from the user application view. In this part we continued to dive into the stuff which is related to the system call concept and saw what the Linux kernel does when a system call occurs.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
@ -402,8 +402,8 @@ Links
|
||||
* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter)
|
||||
* [flags register](https://en.wikipedia.org/wiki/FLAGS_register)
|
||||
* [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table)
|
||||
* [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html)
|
||||
* [general purpose registers](https://en.wikipedia.org/wiki/Processor_register)
|
||||
* [ABI](https://en.wikipedia.org/wiki/Application_binary_interface)
|
||||
* [x86_64 C ABI](http://www.x86-64.org/documentation/abi.pdf)
|
||||
* [previous chapter](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html)
|
||||
* [previous chapter](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-1.html)
|
@ -4,7 +4,7 @@ System calls in the Linux kernel. Part 3.
|
||||
vsyscalls and vDSO
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the third part of the [chapter](http://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) that describes system calls in the Linux kernel and we saw preparations after a system call caused by a userspace application and process of handling of a system call in the previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html). In this part we will look at two concepts that are very close to the system call concept, they are called `vsyscall` and `vdso`.
|
||||
This is the third part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) that describes system calls in the Linux kernel and we saw preparations after a system call caused by a userspace application and process of handling of a system call in the previous [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-2.html). In this part we will look at two concepts that are very close to the system call concept, they are called `vsyscall` and `vdso`.
|
||||
|
||||
We already know what `system call`s are. They are special routines in the Linux kernel which userspace applications ask to do privileged tasks, like to read or to write to a file, to open a socket, etc. As you may know, invoking a system call is an expensive operation in the Linux kernel, because the processor must interrupt the currently executing task and switch context to kernel mode, subsequently jumping again into userspace after the system call handler finishes its work. These two mechanisms - `vsyscall` and `vdso` are designed to speed up this process for certain system calls and in this part we will try to understand how these mechanisms work.
|
||||
|
||||
@ -24,7 +24,7 @@ or:
|
||||
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
|
||||
```
|
||||
|
||||
After this, these system calls will be executed in userspace and this means that there will not be [context switching](https://en.wikipedia.org/wiki/Context_switch). Mapping of the `vsyscall` page occurs in the `map_vsyscall` function that is defined in the [arch/x86/entry/vsyscall/vsyscall_64.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/vsyscall/vsyscall_64.c) source code file. This function is called during the Linux kernel initialization in the `setup_arch` function that is defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c) source code file (we saw this function in the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) of the Linux kernel initialization process chapter).
|
||||
After this, these system calls will be executed in userspace and this means that there will not be [context switching](https://en.wikipedia.org/wiki/Context_switch). Mapping of the `vsyscall` page occurs in the `map_vsyscall` function that is defined in the [arch/x86/entry/vsyscall/vsyscall_64.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/vsyscall/vsyscall_64.c) source code file. This function is called during the Linux kernel initialization in the `setup_arch` function that is defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c) source code file (we saw this function in the fifth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) of the Linux kernel initialization process chapter).
|
||||
|
||||
Note that implementation of the `map_vsyscall` function depends on the `CONFIG_X86_VSYSCALL_EMULATION` kernel configuration option:
|
||||
|
||||
@ -49,7 +49,7 @@ void __init map_vsyscall(void)
|
||||
}
|
||||
```
|
||||
|
||||
As we can see, at the beginning of the `map_vsyscall` function we get the physical address of the `vsyscall` page with the `__pa_symbol` macro (we already saw implementation if this macro in the fourth [path](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process). The `__vsyscall_page` symbol defined in the [arch/x86/entry/vsyscall/vsyscall_emu_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/vsyscall/vsyscall_emu_64.S) assembly source code file and have the following [virtual address](https://en.wikipedia.org/wiki/Virtual_address_space):
|
||||
As we can see, at the beginning of the `map_vsyscall` function we get the physical address of the `vsyscall` page with the `__pa_symbol` macro (we already saw implementation if this macro in the fourth [path](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process). The `__vsyscall_page` symbol defined in the [arch/x86/entry/vsyscall/vsyscall_emu_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/vsyscall/vsyscall_emu_64.S) assembly source code file and have the following [virtual address](https://en.wikipedia.org/wiki/Virtual_address_space):
|
||||
|
||||
```
|
||||
ffffffff81881000 D __vsyscall_page
|
||||
@ -80,7 +80,7 @@ __vsyscall_page:
|
||||
ret
|
||||
```
|
||||
|
||||
Let's go back to the implementation of the `map_vsyscall` function and return to the implementation of the `__vsyscall_page` later. After we received the physical address of the `__vsyscall_page`, we check the value of the `vsyscall_mode` variable and set the [fix-mapped](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html) address for the `vsyscall` page with the `__set_fixmap` macro:
|
||||
Let's go back to the implementation of the `map_vsyscall` function and return to the implementation of the `__vsyscall_page` later. After we received the physical address of the `__vsyscall_page`, we check the value of the `vsyscall_mode` variable and set the [fix-mapped](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-2.html) address for the `vsyscall` page with the `__set_fixmap` macro:
|
||||
|
||||
```C
|
||||
if (vsyscall_mode != NONE)
|
||||
@ -140,9 +140,9 @@ That will be called during early kernel parameters parsing:
|
||||
early_param("vsyscall", vsyscall_setup);
|
||||
```
|
||||
|
||||
More about `early_param` macro you can read in the sixth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the chapter that describes process of the initialization of the Linux kernel.
|
||||
More about `early_param` macro you can read in the sixth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the chapter that describes process of the initialization of the Linux kernel.
|
||||
|
||||
In the end of the `vsyscall_map` function we just check that virtual address of the `vsyscall` page is equal to the value of the `VSYSCALL_ADDR` with the [BUILD_BUG_ON](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) macro:
|
||||
In the end of the `vsyscall_map` function we just check that virtual address of the `vsyscall` page is equal to the value of the `VSYSCALL_ADDR` with the [BUILD_BUG_ON](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) macro:
|
||||
|
||||
```C
|
||||
BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
|
||||
@ -252,7 +252,7 @@ Here we can see that [uname](https://en.wikipedia.org/wiki/Uname) util was linke
|
||||
* `libc.so.6`;
|
||||
* `ld-linux-x86-64.so.2`.
|
||||
|
||||
The first provides `vDSO` functionality, the second is `C` [standard library](https://en.wikipedia.org/wiki/C_standard_library) and the third is the program interpreter (more about this you can read in the part that describes [linkers](http://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html)). So, the `vDSO` solves limitations of the `vsyscall`. Implementation of the `vDSO` is similar to `vsyscall`.
|
||||
The first provides `vDSO` functionality, the second is `C` [standard library](https://en.wikipedia.org/wiki/C_standard_library) and the third is the program interpreter (more about this you can read in the part that describes [linkers](https://0xax.gitbooks.io/linux-insides/content/Misc/linux-misc-3.html)). So, the `vDSO` solves limitations of the `vsyscall`. Implementation of the `vDSO` is similar to `vsyscall`.
|
||||
|
||||
Initialization of the `vDSO` occurs in the `init_vdso` function that defined in the [arch/x86/entry/vdso/vma.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/vdso/vma.c) source code file. This function starts from the initialization of the `vDSO` images for 32-bits and 64-bits depends on the `CONFIG_X86_X32_ABI` kernel configuration option:
|
||||
|
||||
@ -370,11 +370,11 @@ That's all.
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the third part about the system calls concept in the Linux kernel. In the previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) we discussed the implementation of the preparation from the Linux kernel side, before a system call will be handled and implementation of the `exit` process from a system call handler. In this part we continued to dive into the stuff which is related to the system call concept and learned two new concepts that are very similar to the system call - the `vsyscall` and the `vDSO`.
|
||||
This is the end of the third part about the system calls concept in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-2.html) we discussed the implementation of the preparation from the Linux kernel side, before a system call will be handled and implementation of the `exit` process from a system call handler. In this part we continued to dive into the stuff which is related to the system call concept and learned two new concepts that are very similar to the system call - the `vsyscall` and the `vDSO`.
|
||||
|
||||
After all of these three parts, we know almost all things that are related to system calls, we know what system call is and why user applications need them. We also know what occurs when a user application calls a system call and how the kernel handles system calls.
|
||||
|
||||
The next part will be the last part in this [chapter](http://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) and we will see what occurs when a user runs the program.
|
||||
The next part will be the last part in this [chapter](https://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) and we will see what occurs when a user runs the program.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
@ -390,14 +390,14 @@ Links
|
||||
* [virtual address](https://en.wikipedia.org/wiki/Virtual_address_space)
|
||||
* [Segmentation](https://en.wikipedia.org/wiki/Memory_segmentation)
|
||||
* [enum](https://en.wikipedia.org/wiki/Enumerated_type)
|
||||
* [fix-mapped addresses](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html)
|
||||
* [fix-mapped addresses](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-2.html)
|
||||
* [glibc](https://en.wikipedia.org/wiki/GNU_C_Library)
|
||||
* [BUILD_BUG_ON](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html)
|
||||
* [BUILD_BUG_ON](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html)
|
||||
* [Processor register](https://en.wikipedia.org/wiki/Processor_register)
|
||||
* [Page fault](https://en.wikipedia.org/wiki/Page_fault)
|
||||
* [segmentation fault](https://en.wikipedia.org/wiki/Segmentation_fault)
|
||||
* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter)
|
||||
* [stack pointer](https://en.wikipedia.org/wiki/Stack_register)
|
||||
* [uname](https://en.wikipedia.org/wiki/Uname)
|
||||
* [Linkers](http://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html)
|
||||
* [Linkers](https://0xax.gitbooks.io/linux-insides/content/Misc/linux-misc-3.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-2.html)
|
@ -4,7 +4,7 @@ System calls in the Linux kernel. Part 4.
|
||||
How does the Linux kernel run a program
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the fourth part of the [chapter](http://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) that describes [system calls](https://en.wikipedia.org/wiki/System_call) in the Linux kernel and as I wrote in the conclusion of the [previous](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-3.html) - this part will be last in this chapter. In the previous part we stopped at the two new concepts:
|
||||
This is the fourth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) that describes [system calls](https://en.wikipedia.org/wiki/System_call) in the Linux kernel and as I wrote in the conclusion of the [previous](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-3.html) - this part will be last in this chapter. In the previous part we stopped at the two new concepts:
|
||||
|
||||
* `vsyscall`;
|
||||
* `vDSO`;
|
||||
@ -73,7 +73,7 @@ So, a user application (`bash` in our case) calls the system call and as we alre
|
||||
execve system call
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
We saw preparation before a system call called by a user application and after a system call handler finished its work in the second [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) of this chapter. We stopped at the call of the `execve` system call in the previous paragraph. This system call defined in the [fs/exec.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/fs/exec.c) source code file and as we already know it takes three arguments:
|
||||
We saw preparation before a system call called by a user application and after a system call handler finished its work in the second [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-2.html) of this chapter. We stopped at the call of the `execve` system call in the previous paragraph. This system call defined in the [fs/exec.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/fs/exec.c) source code file and as we already know it takes three arguments:
|
||||
|
||||
```
|
||||
SYSCALL_DEFINE3(execve,
|
||||
@ -334,7 +334,7 @@ if (!elf_phdata)
|
||||
goto out;
|
||||
```
|
||||
|
||||
that describes [segments](https://en.wikipedia.org/wiki/Memory_segmentation). Read the `program interpreter` and libraries that linked with the our executable binary file from disk and load it to memory. The `program interpreter` specified in the `.interp` section of the executable file and as you can read in the part that describes [Linkers](http://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html) it is - `/lib64/ld-linux-x86-64.so.2` for the `x86_64`. It setups the stack and map `elf` binary into the correct location in memory. It maps the [bss](https://en.wikipedia.org/wiki/.bss) and the [brk](http://man7.org/linux/man-pages/man2/sbrk.2.html) sections and does many many other different things to prepare executable file to execute.
|
||||
that describes [segments](https://en.wikipedia.org/wiki/Memory_segmentation). Read the `program interpreter` and libraries that linked with the our executable binary file from disk and load it to memory. The `program interpreter` specified in the `.interp` section of the executable file and as you can read in the part that describes [Linkers](https://0xax.gitbooks.io/linux-insides/content/Misc/linux-misc-3.html) it is - `/lib64/ld-linux-x86-64.so.2` for the `x86_64`. It setups the stack and map `elf` binary into the correct location in memory. It maps the [bss](https://en.wikipedia.org/wiki/.bss) and the [brk](http://man7.org/linux/man-pages/man2/sbrk.2.html) sections and does many many other different things to prepare executable file to execute.
|
||||
|
||||
In the end of the execution of the `load_elf_binary` we call the `start_thread` function and pass three arguments to it:
|
||||
|
||||
@ -424,7 +424,7 @@ Links
|
||||
* [Alpha](https://en.wikipedia.org/wiki/DEC_Alpha)
|
||||
* [FDPIC](http://elinux.org/UClinux_Shared_Library#FDPIC_ELF)
|
||||
* [segments](https://en.wikipedia.org/wiki/Memory_segmentation)
|
||||
* [Linkers](http://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html)
|
||||
* [Linkers](https://0xax.gitbooks.io/linux-insides/content/Misc/linux-misc-3.html)
|
||||
* [Processor register](https://en.wikipedia.org/wiki/Processor_register)
|
||||
* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-3.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-3.html)
|
@ -49,7 +49,7 @@ So let's start.
|
||||
Definition of the open system call
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
If you have read the [fourth part](https://github.com/0xAX/linux-insides/blob/master/SysCall/syscall-4.md) of the [linux-insides](https://0xax.gitbooks.io/linux-insides/content/index.html) book, you should know that system calls are defined with the help of `SYSCALL_DEFINE` macro. So, the `open` system call is not exception.
|
||||
If you have read the [fourth part](https://github.com/0xAX/linux-insides/blob/master/SysCall/linux-syscall-4.md) of the [linux-insides](https://0xax.gitbooks.io/linux-insides/content/index.html) book, you should know that system calls are defined with the help of `SYSCALL_DEFINE` macro. So, the `open` system call is not exception.
|
||||
|
||||
Definition of the `open` system call is located in the [fs/open.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/fs/open.c) source code file and looks pretty small for the first view:
|
||||
|
||||
@ -400,4 +400,4 @@ Links
|
||||
* [inode](https://en.wikipedia.org/wiki/Inode)
|
||||
* [RCU](https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt)
|
||||
* [read](http://man7.org/linux/man-pages/man2/read.2.html)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-4.html)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-4.html)
|
@ -2,6 +2,6 @@
|
||||
|
||||
This chapter describes various theoretical concepts and concepts which are not directly related to practice but useful to know.
|
||||
|
||||
* [Paging](Paging.md)
|
||||
* [Elf64 format](ELF.md)
|
||||
* [Inline assembly](asm.md)
|
||||
* [Paging](linux-theory-1.md)
|
||||
* [Elf64 format](linux-theory-2.md)
|
||||
* [Inline assembly](linux-theory-3.md)
|
||||
|
@ -4,7 +4,7 @@ Paging
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) of the series `Linux kernel booting process` we learned about what the kernel does in its earliest stage. In the next step the kernel will initialize different things like `initrd` mounting, lockdep initialization, and many many other things, before we can see how the kernel runs the first init process.
|
||||
In the fifth [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) of the series `Linux kernel booting process` we learned about what the kernel does in its earliest stage. In the next step the kernel will initialize different things like `initrd` mounting, lockdep initialization, and many many other things, before we can see how the kernel runs the first init process.
|
||||
|
||||
Yeah, there will be many different things, but many many and once again many work with **memory**.
|
||||
|
||||
@ -259,4 +259,4 @@ Links
|
||||
* [MMU](http://en.wikipedia.org/wiki/Memory_management_unit)
|
||||
* [ELF64](https://github.com/0xAX/linux-insides/blob/master/Theory/ELF.md)
|
||||
* [Documentation/x86/x86_64/mm.txt](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/x86/x86_64/mm.txt)
|
||||
* [Last part - Kernel booting process](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html)
|
||||
* [Last part - Kernel booting process](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html)
|
@ -73,7 +73,7 @@ Here we see the `native_load_gdt` function which loads a base address from the [
|
||||
The second optional `qualifier` is the `goto`. This qualifier tells the compiler that the given assembly statement may perform a jump to one of the labels which are listed in the `GotoLabels`. For example:
|
||||
|
||||
```C
|
||||
__asm__ goto("jmp %l[label]" : : : label);
|
||||
__asm__ goto("jmp %l[label]" : : : : label);
|
||||
```
|
||||
|
||||
Since we finished with these two qualifiers, let's look at the main part of an assembly statement body. As we have seen above, the main part of an assembly statement consists of the following four parts:
|
||||
@ -190,7 +190,7 @@ If we look at the assembly output:
|
||||
400525: 48 01 d0 add %rdx,%rax
|
||||
```
|
||||
|
||||
we will see that the `%rdx` register is overwritten with `0x64` or `100` and the result will be `115` instead of `15`. Now if we add the `%rdx` register to the list of `clobbered` registers:
|
||||
we will see that the `%rdx` register is overwritten with `0x64` or `100` and the result will be `110` instead of `10`. Now if we add the `%rdx` register to the list of `clobbered` registers:
|
||||
|
||||
```C
|
||||
__asm__("movq $100, %%rdx\t\n"
|
||||
@ -406,25 +406,29 @@ The result, as expected:
|
||||
All of these constraints may be combined (so long as they do not conflict). In this case the compiler will choose the best one for a certain situation. For example:
|
||||
|
||||
```C
|
||||
#include <stdio.h>
|
||||
unsigned long a = 10;
|
||||
unsigned long b = 20;
|
||||
|
||||
unsigned long a = 1;
|
||||
|
||||
int main(void)
|
||||
void main(void)
|
||||
{
|
||||
unsigned long b;
|
||||
__asm__ ("movq %1,%0" : "=r"(b) : "r"(a));
|
||||
return b;
|
||||
__asm__ ("movq %1,%0" : "=mr"(b) : "rm"(a));
|
||||
}
|
||||
```
|
||||
|
||||
will use a memory operand:
|
||||
|
||||
```assembly
|
||||
0000000000400400 <main>:
|
||||
4004aa: 48 8b 05 6f 0b 20 00 mov 0x200b6f(%rip),%rax # 601020 <a>
|
||||
main:
|
||||
movq a(%rip),b(%rip)
|
||||
ret
|
||||
b:
|
||||
.quad 20
|
||||
a:
|
||||
.quad 10
|
||||
```
|
||||
|
||||
instead of direct usage of general purpose registers.
|
||||
|
||||
That's about all of the commonly used constraints in inline assembly statements. You can find more in the official [documentation](https://gcc.gnu.org/onlinedocs/gcc/Simple-Constraints.html#Simple-Constraints).
|
||||
|
||||
Architecture specific constraints
|
@ -2,10 +2,10 @@
|
||||
|
||||
This chapter describes timers and time management related concepts in the linux kernel.
|
||||
|
||||
* [Introduction](timers-1.md) - An introduction to the timers in the Linux kernel.
|
||||
* [Introduction to the clocksource framework](timers-2.md) - Describes `clocksource` framework in the Linux kernel.
|
||||
* [The tick broadcast framework and dyntick](timers-3.md) - Describes tick broadcast framework and dyntick concept.
|
||||
* [Introduction to timers](timers-4.md) - Describes timers in the Linux kernel.
|
||||
* [Introduction to the clockevents framework](timers-5.md) - Describes yet another clock/time management related framework : `clockevents`.
|
||||
* [x86 related clock sources](timers-6.md) - Describes `x86_64` related clock sources.
|
||||
* [Time related system calls in the Linux kernel](timers-7.md) - Describes time related system calls.
|
||||
* [Introduction](linux-timers-1.md) - An introduction to the timers in the Linux kernel.
|
||||
* [Introduction to the clocksource framework](linux-timers-2.md) - Describes `clocksource` framework in the Linux kernel.
|
||||
* [The tick broadcast framework and dyntick](linux-timers-3.md) - Describes tick broadcast framework and dyntick concept.
|
||||
* [Introduction to timers](linux-timers-4.md) - Describes timers in the Linux kernel.
|
||||
* [Introduction to the clockevents framework](linux-timers-5.md) - Describes yet another clock/time management related framework : `clockevents`.
|
||||
* [x86 related clock sources](linux-timers-6.md) - Describes `x86_64` related clock sources.
|
||||
* [Time related system calls in the Linux kernel](linux-timers-7.md) - Describes time related system calls.
|
||||
|
@ -4,7 +4,7 @@ Timers and time management in the Linux kernel. Part 1.
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is yet another post that opens a new chapter in the [linux-insides](http://0xax.gitbooks.io/linux-insides/content/) book. The previous [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-4.html) described [system call](https://en.wikipedia.org/wiki/System_call) concepts, and now it's time to start new chapter. As one might understand from the title, this chapter will be devoted to the `timers` and `time management` in the Linux kernel. The choice of topic for the current chapter is not accidental. Timers (and generally, time management) are very important and widely used in the Linux kernel. The Linux kernel uses timers for various tasks, for example different timeouts in the [TCP](https://en.wikipedia.org/wiki/Transmission_Control_Protocol) implementation, the kernel knowing current time, scheduling asynchronous functions, next event interrupt scheduling and many many more.
|
||||
This is yet another post that opens a new chapter in the [linux-insides](https://0xax.gitbooks.io/linux-insides/content/) book. The previous [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-4.html) described [system call](https://en.wikipedia.org/wiki/System_call) concepts, and now it's time to start new chapter. As one might understand from the title, this chapter will be devoted to the `timers` and `time management` in the Linux kernel. The choice of topic for the current chapter is not accidental. Timers (and generally, time management) are very important and widely used in the Linux kernel. The Linux kernel uses timers for various tasks, for example different timeouts in the [TCP](https://en.wikipedia.org/wiki/Transmission_Control_Protocol) implementation, the kernel knowing current time, scheduling asynchronous functions, next event interrupt scheduling and many many more.
|
||||
|
||||
So, we will start to learn implementation of the different time management related stuff in this part. We will see different types of timers and how different Linux kernel subsystems use them. As always, we will start from the earliest part of the Linux kernel and go through the initialization process of the Linux kernel. We already did it in the special [chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) which describes the initialization process of the Linux kernel, but as you may remember we missed some things there. And one of them is the initialization of timers.
|
||||
|
@ -4,7 +4,7 @@ Timers and time management in the Linux kernel. Part 2.
|
||||
Introduction to the `clocksource` framework
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) was the first part in the current [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) that describes timers and time management related stuff in the Linux kernel. We got acquainted with two concepts in the previous part:
|
||||
The previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-1.html) was the first part in the current [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) that describes timers and time management related stuff in the Linux kernel. We got acquainted with two concepts in the previous part:
|
||||
|
||||
* `jiffies`
|
||||
* `clocksource`
|
||||
@ -92,7 +92,7 @@ Within this framework, each clock source is required to maintain a representatio
|
||||
The clocksource structure
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The fundamental of the `clocksource` framework is the `clocksource` structure that defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/clocksource.h) header file. We already saw some fields that are provided by the `clocksource` structure in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html). Let's look on the full definition of this structure and try to describe all of its fields:
|
||||
The fundamental of the `clocksource` framework is the `clocksource` structure that defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/clocksource.h) header file. We already saw some fields that are provided by the `clocksource` structure in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-1.html). Let's look on the full definition of this structure and try to describe all of its fields:
|
||||
|
||||
```C
|
||||
struct clocksource {
|
||||
@ -197,7 +197,7 @@ That's all. From this moment we know all fields of the `clocksource` structure.
|
||||
New clock source registration
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
We saw only one function from the `clocksource` framework in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html). This function was - `__clocksource_register`. This function defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/tree/master/include/linux/clocksource.h) header file and as we can understand from the function's name, main point of this function is to register new clocksource. If we will look on the implementation of the `__clocksource_register` function, we will see that it just makes call of the `__clocksource_register_scale` function and returns its result:
|
||||
We saw only one function from the `clocksource` framework in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-1.html). This function was - `__clocksource_register`. This function defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/tree/master/include/linux/clocksource.h) header file and as we can understand from the function's name, main point of this function is to register new clocksource. If we will look on the implementation of the `__clocksource_register` function, we will see that it just makes call of the `__clocksource_register_scale` function and returns its result:
|
||||
|
||||
```C
|
||||
static inline int __clocksource_register(struct clocksource *cs)
|
||||
@ -241,7 +241,7 @@ int __clocksource_register_scale(struct clocksource *cs, u32 scale, u32 freq)
|
||||
}
|
||||
```
|
||||
|
||||
First of all we can see that the `__clocksource_register_scale` function starts from the call of the `__clocksource_update_freq_scale` function that defined in the same source code file and updates given clock source with the new frequency. Let's look on the implementation of this function. In the first step we need to check given frequency and if it was not passed as `zero`, we need to calculate `mult` and `shift` parameters for the given clock source. Why do we need to check value of the `frequency`? Actually it can be zero. if you attentively looked on the implementation of the `__clocksource_register` function, you may have noticed that we passed `frequency` as `0`. We will do it only for some clock sources that have self defined `mult` and `shift` parameters. Look in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) and you will see that we saw calculation of the `mult` and `shift` for `jiffies`. The `__clocksource_update_freq_scale` function will do it for us for other clock sources.
|
||||
First of all we can see that the `__clocksource_register_scale` function starts from the call of the `__clocksource_update_freq_scale` function that defined in the same source code file and updates given clock source with the new frequency. Let's look on the implementation of this function. In the first step we need to check given frequency and if it was not passed as `zero`, we need to calculate `mult` and `shift` parameters for the given clock source. Why do we need to check value of the `frequency`? Actually it can be zero. if you attentively looked on the implementation of the `__clocksource_register` function, you may have noticed that we passed `frequency` as `0`. We will do it only for some clock sources that have self defined `mult` and `shift` parameters. Look in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-1.html) and you will see that we saw calculation of the `mult` and `shift` for `jiffies`. The `__clocksource_update_freq_scale` function will do it for us for other clock sources.
|
||||
|
||||
So in the start of the `__clocksource_update_freq_scale` function we check the value of the `frequency` parameter and if is not zero we need to calculate `mult` and `shift` for the given clock source. Let's look on the `mult` and `shift` calculation:
|
||||
|
||||
@ -448,4 +448,4 @@ Links
|
||||
* [clock rate](https://en.wikipedia.org/wiki/Clock_rate)
|
||||
* [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion)
|
||||
* [sysfs](https://en.wikipedia.org/wiki/Sysfs)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-1.html)
|
@ -4,7 +4,7 @@ Timers and time management in the Linux kernel. Part 3.
|
||||
The tick broadcast framework and dyntick
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is third part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel and we stopped on the `clocksource` framework in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html). We have started to consider this framework because it is closely related to the special counters which are provided by the Linux kernel. One of these counters which we already saw in the first [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) of this chapter is - `jiffies`. As I already wrote in the first part of this chapter, we will consider time management related stuff step by step during the Linux kernel initialization. Previous step was call of the:
|
||||
This is third part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel and we stopped on the `clocksource` framework in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-2.html). We have started to consider this framework because it is closely related to the special counters which are provided by the Linux kernel. One of these counters which we already saw in the first [part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-1.html) of this chapter is - `jiffies`. As I already wrote in the first part of this chapter, we will consider time management related stuff step by step during the Linux kernel initialization. Previous step was call of the:
|
||||
|
||||
```C
|
||||
register_refined_jiffies(CLOCK_TICK_RATE);
|
||||
@ -102,7 +102,7 @@ void __init tick_broadcast_init(void)
|
||||
}
|
||||
```
|
||||
|
||||
As we can see, the `tick_broadcast_init` function allocates different [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) with the help of the `zalloc_cpumask_var` function. The `zalloc_cpumask_var` function defined in the [lib/cpumask.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/lib/cpumask.c) source code file and expands to the call of the following function:
|
||||
As we can see, the `tick_broadcast_init` function allocates different [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) with the help of the `zalloc_cpumask_var` function. The `zalloc_cpumask_var` function defined in the [lib/cpumask.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/lib/cpumask.c) source code file and expands to the call of the following function:
|
||||
|
||||
```C
|
||||
bool zalloc_cpumask_var(cpumask_var_t *mask, gfp_t flags)
|
||||
@ -148,7 +148,7 @@ We have initialized six `cpumasks` in the `tick broadcast` framework, and now we
|
||||
The `tick broadcast` framework
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Hardware may provide some clock source devices. When a processor sleeps and its local timer stopped, there must be additional clock source device that will handle awakening of a processor. The Linux kernel uses these `special` clock source devices which can raise an interrupt at a specified time. We already know that such timers called `clock events` devices in the Linux kernel. Besides `clock events` devices. Actually, each processor in the system has its own local timer which is programmed to issue interrupt at the time of the next deferred task. Also these timers can be programmed to do a periodical job, like updating `jiffies` and etc. These timers represented by the `tick_device` structure in the Linux kernel. This structure defined in the [kernel/time/tick-sched.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/time/tick-sched.h) header file and looks:
|
||||
Hardware may provide some clock source devices. When a processor sleeps and its local timer stopped, there must be additional clock source device that will handle awakening of a processor. The Linux kernel uses these `special` clock source devices which can raise an interrupt at a specified time. We already know that such timers called `clock events` devices in the Linux kernel. Besides `clock events` devices, each processor in the system has its own local timer which is programmed to issue interrupt at the time of the next deferred task. Also these timers can be programmed to do a periodical job, like updating `jiffies` and etc. These timers represented by the `tick_device` structure in the Linux kernel. This structure defined in the [kernel/time/tick-sched.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/time/tick-sched.h) header file and looks:
|
||||
|
||||
```C
|
||||
struct tick_device {
|
||||
@ -407,7 +407,7 @@ for_each_cpu(cpu, tick_nohz_full_mask)
|
||||
context_tracking_cpu_set(cpu);
|
||||
```
|
||||
|
||||
The `context_tracking_cpu_set` function defined in the [kernel/context_tracking.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/context_tracking.c) source code file and main point of this function is to set the `context_tracking.active` [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable to `true`. When the `active` field will be set to `true` for the certain processor, all [context switches](https://en.wikipedia.org/wiki/Context_switch) will be ignored by the Linux kernel context tracking subsystem for this processor.
|
||||
The `context_tracking_cpu_set` function defined in the [kernel/context_tracking.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/context_tracking.c) source code file and main point of this function is to set the `context_tracking.active` [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) variable to `true`. When the `active` field will be set to `true` for the certain processor, all [context switches](https://en.wikipedia.org/wiki/Context_switch) will be ignored by the Linux kernel context tracking subsystem for this processor.
|
||||
|
||||
That's all. This is the end of the `tick_nohz_init` function. After this `NO_HZ` related data structures will be initialized. We didn't see API of the `NO_HZ` mode, but will see it soon.
|
||||
|
||||
@ -433,12 +433,12 @@ Links
|
||||
* [CPU idle](https://en.wikipedia.org/wiki/Idle_%28CPU%29)
|
||||
* [power management](https://en.wikipedia.org/wiki/Power_management)
|
||||
* [NO_HZ documentation](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/timers/NO_HZ.txt)
|
||||
* [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html)
|
||||
* [high precision event timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer)
|
||||
* [irq](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
|
||||
* [IPI](https://en.wikipedia.org/wiki/Inter-processor_interrupt)
|
||||
* [CPUID](https://en.wikipedia.org/wiki/CPUID)
|
||||
* [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)
|
||||
* [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html)
|
||||
* [context switches](https://en.wikipedia.org/wiki/Context_switch)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-2.html)
|
@ -4,7 +4,7 @@ Timers and time management in the Linux kernel. Part 4.
|
||||
Timers
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is fourth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-3.html) we knew about the `tick broadcast` framework and `NO_HZ` mode in the Linux kernel. We will continue to dive into the time management related stuff in the Linux kernel in this part and will be acquainted with yet another concept in the Linux kernel - `timers`. Before we will look at timers in the Linux kernel, we have to learn some theory about this concept. Note that we will consider software timers in this part.
|
||||
This is fourth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-3.html) we knew about the `tick broadcast` framework and `NO_HZ` mode in the Linux kernel. We will continue to dive into the time management related stuff in the Linux kernel in this part and will be acquainted with yet another concept in the Linux kernel - `timers`. Before we will look at timers in the Linux kernel, we have to learn some theory about this concept. Note that we will consider software timers in this part.
|
||||
|
||||
The Linux kernel provides a `software timer` concept to allow to kernel functions could be invoked at future moment. Timers are widely used in the Linux kernel. For example, look in the [net/netfilter/ipset/ip_set_list_set.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/net/netfilter/ipset/ip_set_list_set.c) source code file. This source code file provides implementation of the framework for the managing of groups of [IP](https://en.wikipedia.org/wiki/Internet_Protocol) addresses.
|
||||
|
||||
@ -45,7 +45,7 @@ Now let's continue to research source code of Linux kernel which is related to t
|
||||
Introduction to dynamic timers in the Linux kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As I already wrote, we knew about the `tick broadcast` framework and `NO_HZ` mode in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-3.html). They will be initialized in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) source code file by the call of the `tick_init` function. If we will look at this source code file, we will see that the next time management related function is:
|
||||
As I already wrote, we knew about the `tick broadcast` framework and `NO_HZ` mode in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-3.html). They will be initialized in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) source code file by the call of the `tick_init` function. If we will look at this source code file, we will see that the next time management related function is:
|
||||
|
||||
```C
|
||||
init_timers();
|
||||
@ -75,7 +75,7 @@ static void __init init_timer_cpus(void)
|
||||
}
|
||||
```
|
||||
|
||||
If you do not know or do not remember what is it a `possible` cpu, you can read the special [part](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) of this book which describes `cpumask` concept in the Linux kernel. In short words, a `possible` processor is a processor which can be plugged in anytime during the life of the system.
|
||||
If you do not know or do not remember what is it a `possible` cpu, you can read the special [part](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) of this book which describes `cpumask` concept in the Linux kernel. In short words, a `possible` processor is a processor which can be plugged in anytime during the life of the system.
|
||||
|
||||
The `init_timer_cpu` function does main work for us, namely it executes initialization of the `tvec_base` structure for each processor. This structure defined in the [kernel/time/timer.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/time/timer.c) source code file and stores data related to a `dynamic` timer for a certain processor. Let's look on the definition of this structure:
|
||||
|
||||
@ -136,13 +136,13 @@ static void __init init_timer_cpu(int cpu)
|
||||
}
|
||||
```
|
||||
|
||||
The `tvec_bases` represents [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable which represents main data structure for a dynamic timer for a given processor. This `per-cpu` variable defined in the same source code file:
|
||||
The `tvec_bases` represents [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) variable which represents main data structure for a dynamic timer for a given processor. This `per-cpu` variable defined in the same source code file:
|
||||
|
||||
```C
|
||||
static DEFINE_PER_CPU(struct tvec_base, tvec_bases);
|
||||
```
|
||||
|
||||
First of all we're getting the address of the `tvec_bases` for the given processor to `base` variable and as we got it, we are starting to initialize some of the `tvec_base` fields in the `init_timer_cpu` function. After initialization of the `per-cpu` dynamic timers with the [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) and the number of a possible processor, we need to initialize a `tstats_lookup_lock` [spinlock](https://en.wikipedia.org/wiki/Spinlock) in the `init_timer_stats` function:
|
||||
First of all we're getting the address of the `tvec_bases` for the given processor to `base` variable and as we got it, we are starting to initialize some of the `tvec_base` fields in the `init_timer_cpu` function. After initialization of the `per-cpu` dynamic timers with the [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-1.html) and the number of a possible processor, we need to initialize a `tstats_lookup_lock` [spinlock](https://en.wikipedia.org/wiki/Spinlock) in the `init_timer_stats` function:
|
||||
|
||||
```C
|
||||
void __init init_timer_stats(void)
|
||||
@ -240,7 +240,7 @@ The last step in the `init_timers` function is the call of the:
|
||||
open_softirq(TIMER_SOFTIRQ, run_timer_softirq);
|
||||
```
|
||||
|
||||
function. The `open_softirq` function may be already familiar to you if you have read the ninth [part](https://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html) about the interrupts and interrupt handling in the Linux kernel. In short words, the `open_softirq` function defined in the [kernel/softirq.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/softirq.c) source code file and executes initialization of the deferred interrupt handler.
|
||||
function. The `open_softirq` function may be already familiar to you if you have read the ninth [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-9.html) about the interrupts and interrupt handling in the Linux kernel. In short words, the `open_softirq` function defined in the [kernel/softirq.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/softirq.c) source code file and executes initialization of the deferred interrupt handler.
|
||||
|
||||
In our case the deferred function is the `run_timer_softirq` function that is will be called after a hardware interrupt in the `do_IRQ` function which defined in the [arch/x86/kernel/irq.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/irq.c) source code file. The main point of this function is to handle a software dynamic timer. The Linux kernel does not do this thing during the hardware timer interrupt handling because this is time consuming operation.
|
||||
|
||||
@ -256,7 +256,7 @@ static void run_timer_softirq(struct softirq_action *h)
|
||||
}
|
||||
```
|
||||
|
||||
At the beginning of the `run_timer_softirq` function we get a `dynamic` timer for a current processor and compares the current value of the [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) with the value of the `timer_jiffies` for the current structure by the call of the `time_after_eq` macro which is defined in the [include/linux/jiffies.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/jiffies.h) header file:
|
||||
At the beginning of the `run_timer_softirq` function we get a `dynamic` timer for a current processor and compares the current value of the [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-1.html) with the value of the `timer_jiffies` for the current structure by the call of the `time_after_eq` macro which is defined in the [include/linux/jiffies.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/jiffies.h) header file:
|
||||
|
||||
```C
|
||||
#define time_after_eq(a,b) \
|
||||
@ -418,10 +418,10 @@ Links
|
||||
* [IP](https://en.wikipedia.org/wiki/Internet_Protocol)
|
||||
* [netfilter](https://en.wikipedia.org/wiki/Netfilter)
|
||||
* [network](https://en.wikipedia.org/wiki/Computer_network)
|
||||
* [cpumask](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [cpumask](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html)
|
||||
* [interrupt](https://en.wikipedia.org/wiki/Interrupt)
|
||||
* [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html)
|
||||
* [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-1.html)
|
||||
* [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html)
|
||||
* [spinlock](https://en.wikipedia.org/wiki/Spinlock)
|
||||
* [procfs](https://en.wikipedia.org/wiki/Procfs)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-3.html)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-3.html)
|
@ -4,7 +4,7 @@ Timers and time management in the Linux kernel. Part 5.
|
||||
Introduction to the `clockevents` framework
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is fifth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel. As you might noted from the title of this part, the `clockevents` framework will be discussed. We already saw one framework in the [second](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html) part of this chapter. It was `clocksource` framework. Both of these frameworks represent timekeeping abstractions in the Linux kernel.
|
||||
This is fifth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel. As you might noted from the title of this part, the `clockevents` framework will be discussed. We already saw one framework in the [second](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-2.html) part of this chapter. It was `clocksource` framework. Both of these frameworks represent timekeeping abstractions in the Linux kernel.
|
||||
|
||||
At first let's refresh your memory and try to remember what is it `clocksource` framework and and what its purpose. The main goal of the `clocksource` framework is to provide `timeline`. As described in the [documentation](https://github.com/0xAX/linux/blob/0a07b238e5f488b459b6113a62e06b6aab017f71/Documentation/timers/timekeeping.txt):
|
||||
|
||||
@ -130,7 +130,7 @@ The next two fields `shift` and `mult` are familiar to us. They will be used to
|
||||
#define cpumask_of(cpu) (get_cpu_mask(cpu))
|
||||
```
|
||||
|
||||
Where the `get_cpu_mask` returns the cpumask containing just a given `cpu` number. More about `cpumasks` concept you may read in the [CPU masks in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) part. In the last four lines of code we set callbacks for the clock event device suspend/resume, device shutdown and update of the clock event device state.
|
||||
Where the `get_cpu_mask` returns the cpumask containing just a given `cpu` number. More about `cpumasks` concept you may read in the [CPU masks in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) part. In the last four lines of code we set callbacks for the clock event device suspend/resume, device shutdown and update of the clock event device state.
|
||||
|
||||
After we finished with the initialization of the `at91sam926x` periodic timer, we can register it by the call of the following functions:
|
||||
|
||||
@ -409,7 +409,7 @@ Links
|
||||
* [local APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)
|
||||
* [C3 state](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface#Device_states)
|
||||
* [Periodic Interval Timer (PIT) for at91sam926x](http://www.atmel.com/Images/doc6062.pdf)
|
||||
* [CPU masks in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [CPU masks in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html)
|
||||
* [deadlock](https://en.wikipedia.org/wiki/Deadlock)
|
||||
* [CPU hotplug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-3.html)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-3.html)
|
@ -4,7 +4,7 @@ Timers and time management in the Linux kernel. Part 6.
|
||||
x86_64 related clock sources
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is sixth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-5.html) we saw `clockevents` framework and now we will continue to dive into time management related stuff in the Linux kernel. This part will describe implementation of [x86](https://en.wikipedia.org/wiki/X86) architecture related clock sources (more about `clocksource` concept you can read in the [second part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html) of this chapter).
|
||||
This is sixth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-5.html) we saw `clockevents` framework and now we will continue to dive into time management related stuff in the Linux kernel. This part will describe implementation of [x86](https://en.wikipedia.org/wiki/X86) architecture related clock sources (more about `clocksource` concept you can read in the [second part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-2.html) of this chapter).
|
||||
|
||||
First of all we must know what clock sources may be used at `x86` architecture. It is easy to know from the [sysfs](https://en.wikipedia.org/wiki/Sysfs) or from content of the `/sys/devices/system/clocksource/clocksource0/available_clocksource`. The `/sys/devices/system/clocksource/clocksourceN` provides two special files to achieve this:
|
||||
|
||||
@ -31,7 +31,7 @@ $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
|
||||
tsc
|
||||
```
|
||||
|
||||
For me it is [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter). As we may know from the [second part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html) of this chapter, which describes internals of the `clocksource` framework in the Linux kernel, the best clock source in a system is a clock source with the best (highest) rating or in other words with the highest [frequency](https://en.wikipedia.org/wiki/Frequency).
|
||||
For me it is [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter). As we may know from the [second part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-2.html) of this chapter, which describes internals of the `clocksource` framework in the Linux kernel, the best clock source in a system is a clock source with the best (highest) rating or in other words with the highest [frequency](https://en.wikipedia.org/wiki/Frequency).
|
||||
|
||||
Frequency of the [ACPI](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) power management timer is `3.579545 MHz`. Frequency of the [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) is at least `10 MHz`. And the frequency of the [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) depends on processor. For example On older processors, the `Time Stamp Counter` was counting internal processor clock cycles. This means its frequency changed when the processor's frequency scaling changed. The situation has changed for newer processors. Newer processors have an `invariant Time Stamp counter` that increments at a constant rate in all operational states of processor. Actually we can get its frequency in the output of the `/proc/cpuinfo`. For example for the first processor in the system:
|
||||
|
||||
@ -410,4 +410,4 @@ Links
|
||||
* [IRQ0](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29#Master_PIC)
|
||||
* [i8259](https://en.wikipedia.org/wiki/Intel_8259)
|
||||
* [initcall](http://www.compsoc.man.ac.uk/~moz/kernelnewbies/documents/initcall/kernel.html)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-5.html)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-5.html)
|
@ -4,7 +4,7 @@ Timers and time management in the Linux kernel. Part 7.
|
||||
Time related system calls in the Linux kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the seventh and last part [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html), which describes timers and time management related stuff in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-6.html), we discussed timers in the context of [x86_64](https://en.wikipedia.org/wiki/X86-64): [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) and [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter). Internal time management is an interesting part of the Linux kernel, but of course not only the kernel needs the `time` concept. Our programs also need to know time. In this part, we will consider implementation of some time management related [system calls](https://en.wikipedia.org/wiki/System_call). These system calls are:
|
||||
This is the seventh and last part [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html), which describes timers and time management related stuff in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-6.html), we discussed timers in the context of [x86_64](https://en.wikipedia.org/wiki/X86-64): [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) and [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter). Internal time management is an interesting part of the Linux kernel, but of course not only the kernel needs the `time` concept. Our programs also need to know time. In this part, we will consider implementation of some time management related [system calls](https://en.wikipedia.org/wiki/System_call). These system calls are:
|
||||
|
||||
* `clock_gettime`;
|
||||
* `gettimeofday`;
|
||||
@ -57,7 +57,7 @@ The second parameter of the `gettimeofday` function is a pointer to the `timezon
|
||||
Current date/time: 03-26-2016/16:42:02
|
||||
```
|
||||
|
||||
As you may already know, a userspace application does not call a system call directly from the kernel space. Before the actual system call entry will be called, we call a function from the standard library. In my case it is [glibc](https://en.wikipedia.org/wiki/GNU_C_Library), so I will consider this case. The implementation of the `gettimeofday` function is located in the [sysdeps/unix/sysv/linux/x86/gettimeofday.c](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86/gettimeofday.c;h=36f7c26ffb0e818709d032c605fec8c4bd22a14e;hb=HEAD) source code file. As you already may know, the `gettimeofday` is not a usual system call. It is located in the special area which is called `vDSO` (you can read more about it in the [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-3.html), which describes this concept).
|
||||
As you may already know, a userspace application does not call a system call directly from the kernel space. Before the actual system call entry will be called, we call a function from the standard library. In my case it is [glibc](https://en.wikipedia.org/wiki/GNU_C_Library), so I will consider this case. The implementation of the `gettimeofday` function is located in the [sysdeps/unix/sysv/linux/x86/gettimeofday.c](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86/gettimeofday.c;h=36f7c26ffb0e818709d032c605fec8c4bd22a14e;hb=HEAD) source code file. As you already may know, the `gettimeofday` is not a usual system call. It is located in the special area which is called `vDSO` (you can read more about it in the [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-3.html), which describes this concept).
|
||||
|
||||
The `glibc` implementation of `gettimeofday` tries to resolve the given symbol; in our case this symbol is `__vdso_gettimeofday` by the call of the `_dl_vdso_vsym` internal function. If the symbol cannot be resolved, it returns `NULL` and we fallback to the call of the usual system call:
|
||||
|
||||
@ -369,7 +369,7 @@ static inline bool timespec_valid(const struct timespec *ts)
|
||||
}
|
||||
```
|
||||
|
||||
which just checks that the given `timespec` does not represent date before `1970` and nanoseconds does not overflow `1` second. The `nanosleep` function ends with the call of the `hrtimer_nanosleep` function from the same source code file. The `hrtimer_nanosleep` function creates a [timer](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-4.html) and calls the `do_nanosleep` function. The `do_nanosleep` does main job for us. This function provides loop:
|
||||
which just checks that the given `timespec` does not represent date before `1970` and nanoseconds does not overflow `1` second. The `nanosleep` function ends with the call of the `hrtimer_nanosleep` function from the same source code file. The `hrtimer_nanosleep` function creates a [timer](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-4.html) and calls the `do_nanosleep` function. The `do_nanosleep` does main job for us. This function provides loop:
|
||||
|
||||
```C
|
||||
do {
|
||||
@ -412,10 +412,10 @@ Links
|
||||
* [register](https://en.wikipedia.org/wiki/Processor_register)
|
||||
* [System V Application Binary Interface](http://www.x86-64.org/documentation/abi.pdf)
|
||||
* [context switch](https://en.wikipedia.org/wiki/Context_switch)
|
||||
* [Introduction to timers in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-4.html)
|
||||
* [Introduction to timers in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-4.html)
|
||||
* [uptime](https://en.wikipedia.org/wiki/Uptime#Using_uptime)
|
||||
* [system calls table for x86_64](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/syscalls/syscall_64.tbl)
|
||||
* [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer)
|
||||
* [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-6.html)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-6.html)
|
@ -112,4 +112,9 @@ Thank you to all contributors:
|
||||
* [Andrés Rojas](https://github.com/c0r3dump3d)
|
||||
* [Beomsu Kim](https://github.com/0xF0D0)
|
||||
* [Firo Yang](https://github.com/firogh)
|
||||
* [Edward Hu](https://github.com/BDHU)
|
||||
* [Edward Hu](https://github.com/BDHU)
|
||||
* [WarpspeedSCP](https://github.com/WarpspeedSCP)
|
||||
* [Gabriela Moldovan](https://github.com/gabi-250)
|
||||
* [kuritonasu](https://github.com/kuritonasu/)
|
||||
* [Miles Frain](https://github.com/milesfrain)
|
||||
* [Horace Heaven](https://github.com/horaceheaven)
|
||||
|
@ -1,14 +0,0 @@
|
||||
# Interrupts and Interrupt Handling
|
||||
|
||||
In the following posts, we will cover interrupts and exceptions handling in the linux kernel.
|
||||
|
||||
* [Interrupts and Interrupt Handling. Part 1.](interrupts-1.md) - describes interrupts and interrupt handling theory.
|
||||
* [Interrupts in the Linux Kernel](interrupts-2.md) - describes stuffs related to interrupts and exceptions handling from the early stage.
|
||||
* [Early interrupt handlers](interrupts-3.md) - describes early interrupt handlers.
|
||||
* [Interrupt handlers](interrupts-4.md) - describes first non-early interrupt handlers.
|
||||
* [Implementation of exception handlers](interrupts-5.md) - describes implementation of some exception handlers such as double fault, divide by zero etc.
|
||||
* [Handling non-maskable interrupts](interrupts-6.md) - describes handling of non-maskable interrupts and remaining interrupt handlers from the architecture-specific part.
|
||||
* [External hardware interrupts](interrupts-7.md) - describes early initialization of code which is related to handling external hardware interrupts.
|
||||
* [Non-early initialization of the IRQs](interrupts-8.md) - describes non-early initialization of code which is related to handling external hardware interrupts.
|
||||
* [Softirq, Tasklets and Workqueues](interrupts-9.md) - describes softirqs, tasklets and workqueues concepts.
|
||||
* [Last part](interrupts-10.md) - this is the last part of the `Interrupts and Interrupt Handling` chapter and here we will see a real hardware driver and some interrupts related stuff.
|
Loading…
Reference in New Issue
Block a user