Merge branch 'master' of github.com:0xAX/linux-insides

pull/133/head
Alexander Kuleshov 9 years ago
commit 280f2ee16e

@ -135,7 +135,7 @@ We will see:
In this example we can see that this code will be executed in 16 bit real mode and will start at 0x7c00 in memory. After the start it calls the [0x10](http://www.ctyme.com/intr/rb-0106.htm) interrupt which just prints `!` symbol. It fills rest of 510 bytes with zeros and finish with two magic bytes `0xaa` and `0x55`.
Although you can see binary dump of it with `objdump` util:
You can see binary dump of it with `objdump` util:
```
nasm -f bin boot.nasm
@ -252,10 +252,6 @@ Start of Kernel Setup
Finally we are in the kernel. Technically kernel didn't run yet, first of all we need to setup kernel, memory manager, process manager etc. Kernel setup execution starts from [arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S) at the [_start](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L293). It is a little strange at the first look, there are many instructions before it.
=======
Finally we are in the kernel. Technically kernel didn't run yet, first of all we need to setup kernel, memory manager, process manager, etc. Kernel setup execution starts from [arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S) at the [_start](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L293). It is little strange at the first look, there are many instructions before it. Actually....
Actually Long time ago Linux kernel had its own bootloader, but now if you run for example:
```

@ -313,9 +313,9 @@ After that `biosregs` structure is filled with `memset`, `bios_putchar` calls th
Heap initialization
--------------------------------------------------------------------------------
After the stack and bss section were prepared in [header.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S) (see previous [part](linux-bootstrap-1.md)), the kernel needs to initialize the [heap](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c#L116) with the [init_heap](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c#L116) function.
After the stack and bss section were prepared in [header.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S) (see previous [part](linux-bootstrap-1.md)), the kernel needs to initialize the [heap](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c#L116) with the [`init_heap`](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c#L116) function.
First of all `init_heap` checks the [`CAN_USE_HEAP`](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/bootparam.h#L21) flag from the [`loadflags`](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L321) kernel setup header and calculates the end of the stack if this flag was set:
First of all `init_heap` checks the [`CAN_USE_HEAP`](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/bootparam.h#L21) flag from the [`loadflags`](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L321) in the kernel setup header and calculates the end of the stack if this flag was set:
```C
char *stack_end;
@ -327,9 +327,13 @@ First of all `init_heap` checks the [`CAN_USE_HEAP`](https://github.com/torvalds
or in other words `stack_end = esp - STACK_SIZE`.
Then there is the `heap_end` calculation which is `heap_end_ptr` or `_end` + 512 and a check if `heap_end` is greater than `stack_end` makes it equal.
Then there is the `heap_end` calculation:
```c
heap_end = (char *)((size_t)boot_params.hdr.heap_end_ptr + 0x200);
```
which means `heap_end_ptr` or `_end` + `512`(`0x200h`). And at the last is checked that whether `heap_end` is greater than `stack_end`. If it is then `stack_end` is assigned to `heap_end` to make them equal.
From this moment we can use the heap in the kernel setup code. We will see how to use it and how the API for it is implemented in the next posts.
Now the heap is initialized and we can use it using the `GET_HEAP` method. We will see how it is used, how to use it and how the it is implemented in the next posts.
CPU validation
--------------------------------------------------------------------------------
@ -344,12 +348,14 @@ check_cpu(&cpu_level, &req_level, &err_flags);
return -1;
}
```
It checks the cpu's flags, presence of [long mode](http://en.wikipedia.org/wiki/Long_mode) (which we will see in more detail in the next sections) in case of x86_64(64-bit) CPU, checks the processor's vendor and makes preparation for certain vendors like turning off SSE+SSE2 for AMD if they are missing, etc.
`check_cpu` checks the cpu's flags, presence of [long mode](http://en.wikipedia.org/wiki/Long_mode) in case of x86_64(64-bit) CPU, checks the processor's vendor and makes preparation for certain vendors like turning off SSE+SSE2 for AMD if they are missing, etc.
Memory detection
--------------------------------------------------------------------------------
The next step is memory detection by the `detect_memory` function. It uses different programming interfaces for memory detection like `0xe820`, `0xe801` and `0x88`. We will see only the implementation of 0xE820 here. Let's look into the `detect_memory_e820` implementation from the [arch/x86/boot/memory.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/memory.c) source file. First of all, the `detect_memory_e820` function initializes the `biosregs` structure as we saw above and fills registers with special values for the `0xe820` call:
The next step is memory detection by the `detect_memory` function. `detect_memory` basically provides a map of available RAM to the cpu. It uses different programming interfaces for memory detection like `0xe820`, `0xe801` and `0x88`. We will see only the implementation of **0xE820** here.
Let's look into the `detect_memory_e820` implementation from the [arch/x86/boot/memory.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/memory.c) source file. First of all, the `detect_memory_e820` function initializes the `biosregs` structure as we saw above and fills registers with special values for the `0xe820` call:
```assembly
initregs(&ireg);
@ -359,9 +365,13 @@ The next step is memory detection by the `detect_memory` function. It uses diffe
ireg.di = (size_t)&buf;
```
The `ax` register must contain the number of the function (0xe820 in our case), `cx` register contains size of the buffer which will contain data about memory, `edx` must contain the `SMAP` magic number, `es:di` must contain the address of the buffer which will contain memory data and `ebx` has to be zero.
* `ax` contains the number of the function (0xe820 in our case)
* `cx` register contains size of the buffer which will contain data about memory
* `edx` must contain the `SMAP` magic number
* `es:di` must contain the address of the buffer which will contain memory data
* `ebx` has to be zero.
Next is a loop where data about the memory will be collected. It starts from the call of the 0x15 bios interrupt, which writes one line from the address allocation table. For getting the next line we need to call this interrupt again (which we do in the loop). Before the next call `ebx` must contain the value returned previously:
Next is a loop where data about the memory will be collected. It starts from the call of the `0x15` BIOS interrupt, which writes one line from the address allocation table. For getting the next line we need to call this interrupt again (which we do in the loop). Before the next call `ebx` must contain the value returned previously:
```C
intcall(0x15, &ireg, &oreg);
@ -389,7 +399,18 @@ You can see the result of this in the `dmesg` output, something like:
Keyboard initialization
--------------------------------------------------------------------------------
The next step is the initialization of the keyboard with the call of the [`keyboard_init()`](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c#L65) function. At first `keyboard_init` initializes registers using the `initregs` function and calling the [0x16](http://www.ctyme.com/intr/rb-1756.htm) interrupt for getting the keyboard status. After this it calls [0x16](http://www.ctyme.com/intr/rb-1757.htm) again to set repeat rate and delay.
The next step is the initialization of the keyboard with the call of the [`keyboard_init()`](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c#L65) function. At first `keyboard_init` initializes registers using the `initregs` function and calling the [0x16](http://www.ctyme.com/intr/rb-1756.htm) interrupt for getting the keyboard status.
```c
initregs(&ireg);
ireg.ah = 0x02; /* Get keyboard status */
intcall(0x16, &ireg, &oreg);
boot_params.kbd_status = oreg.al;
```
After this it calls [0x16](http://www.ctyme.com/intr/rb-1757.htm) again to set repeat rate and delay.
```c
ireg.ax = 0x0305; /* Set keyboard repeat rate */
intcall(0x16, &ireg, NULL);
```
Querying
--------------------------------------------------------------------------------
@ -422,7 +443,7 @@ int query_mca(void)
}
```
It fills the `ah` register with `0xc0` and calls the `0x15` BIOS interruption. After the interrupt execution it checks the [carry flag](http://en.wikipedia.org/wiki/Carry_flag) and if it is set to 1, the BIOS doesn't support `MCA`. If carry flag is set to 0, `ES:BX` will contain a pointer to the system information table, which looks like this:
It fills the `ah` register with `0xc0` and calls the `0x15` BIOS interruption. After the interrupt execution it checks the [carry flag](http://en.wikipedia.org/wiki/Carry_flag) and if it is set to 1, the BIOS doesn't support (**MCA**)[https://en.wikipedia.org/wiki/Micro_Channel_architecture]. If carry flag is set to 0, `ES:BX` will contain a pointer to the system information table, which looks like this:
```
Offset Size Description )
@ -460,15 +481,15 @@ static inline void set_fs(u16 seg)
}
```
There is inline assembly which gets the value of the `seg` parameter and puts it into the `fs` register. There are many functions in [boot.h](https://github.com/torvalds/linux/blob/master/arch/x86/boot/boot.h) like `set_fs`, for example `set_gs`, `fs`, `gs` for reading a value in it etc...
This function contains inline assembly which gets the value of the `seg` parameter and puts it into the `fs` register. There are many functions in [boot.h](https://github.com/torvalds/linux/blob/master/arch/x86/boot/boot.h) like `set_fs`, for example `set_gs`, `fs`, `gs` for reading a value in it etc...
At the end of `query_mca` it just copies the table which pointed to by `es:bx` to the `boot_params.sys_desc_table`.
The next step is getting [Intel SpeedStep](http://en.wikipedia.org/wiki/SpeedStep) information by calling the `query_ist` function. First of all it checks the CPU level and if it is correct, calls `0x15` for getting info and saves the result to `boot_params`.
The following [query_apm_bios](https://github.com/torvalds/linux/blob/master/arch/x86/boot/apm.c#L21) function gets [Advanced Power Management](http://en.wikipedia.org/wiki/Advanced_Power_Management) information from the BIOS. `query_apm_bios` calls the `0x15` BIOS interruption too, but with `ah` - `0x53` to check `APM` installation. After the `0x15` execution, `query_apm_bios` functions checks `PM` signature (it must be `0x504d`), carry flag (it must be 0 if `APM` supported) and value of the `cx` register (if it's 0x02, protected mode interface is supported).
The following [query_apm_bios](https://github.com/torvalds/linux/blob/master/arch/x86/boot/apm.c#L21) function gets [Advanced Power Management](http://en.wikipedia.org/wiki/Advanced_Power_Management) information from the BIOS. `query_apm_bios` calls the `0x15` BIOS interruption too, but with `ah` = `0x53` to check `APM` installation. After the `0x15` execution, `query_apm_bios` functions checks `PM` signature (it must be `0x504d`), carry flag (it must be 0 if `APM` supported) and value of the `cx` register (if it's 0x02, protected mode interface is supported).
Next it calls the `0x15` again, but with `ax = 0x5304` for disconnecting the `APM` interface and connect the 32bit protected mode interface. In the end it fills `boot_params.apm_bios_info` with values obtained from the BIOS.
Next it calls the `0x15` again, but with `ax = 0x5304` for disconnecting the `APM` interface and connecting the 32-bit protected mode interface. In the end it fills `boot_params.apm_bios_info` with values obtained from the BIOS.
Note that `query_apm_bios` will be executed only if `CONFIG_APM` or `CONFIG_APM_MODULE` was set in configuration file:
@ -478,7 +499,7 @@ Note that `query_apm_bios` will be executed only if `CONFIG_APM` or `CONFIG_APM_
#endif
```
The last is the [`query_edd`](https://github.com/torvalds/linux/blob/master/arch/x86/boot/edd.c#L122) function, which asks `Enhanced Disk Drive` information from the BIOS. Let's look into the `query_edd` implementation.
The last is the [`query_edd`](https://github.com/torvalds/linux/blob/master/arch/x86/boot/edd.c#L122) function, which queries `Enhanced Disk Drive` information from the BIOS. Let's look into the `query_edd` implementation.
First of all it reads the [edd](https://github.com/torvalds/linux/blob/master/Documentation/kernel-parameters.txt#L1023) option from kernel's command line and if it was set to `off` then `query_edd` just returns.
@ -496,7 +517,7 @@ If EDD is enabled, `query_edd` goes over BIOS-supported hard disks and queries E
...
```
where `0x80` is the first hard drive and the `EDD_MBR_SIG_MAX` macro is 16. It collects data into the array of [edd_info](https://github.com/torvalds/linux/blob/master/include/uapi/linux/edd.h#L172) structures. `get_edd_info` checks that EDD is present by invoking the `0x13` interrupt with `ah` as `0x41` and if EDD is present, `get_edd_info` again calls the `0x13` interrupt, but with `ah` as `0x48` and `si` containing the address of the buffer where EDD information will be stored.
where `0x80` is the first hard drive and the value of `EDD_MBR_SIG_MAX` macro is 16. It collects data into the array of [edd_info](https://github.com/torvalds/linux/blob/master/include/uapi/linux/edd.h#L172) structures. `get_edd_info` checks that EDD is present by invoking the `0x13` interrupt with `ah` as `0x41` and if EDD is present, `get_edd_info` again calls the `0x13` interrupt, but with `ah` as `0x48` and `si` containing the address of the buffer where EDD information will be stored.
Conclusion
--------------------------------------------------------------------------------

@ -72,7 +72,7 @@ Now we know default physical and virtual addresses of the `startup_64` routine,
subq $_text - __START_KERNEL_map, %rbp
```
Here we just put the `rip-relative` address to the `rbp` register and than subtract `$_text - __START_KERNEL_map` from it. We know that compiled address of the `_text` is `0xffffffff81000000` and `__START_KERNEL_map` contains `0xffffffff81000000`, so `rbp` will contain physical address of the `text` - `0x1000000` after this calculation. We need to calculate it because kernel can be runned not on the default address, but now we know actual physical address.
Here we just put the `rip-relative` address to the `rbp` register and then subtract `$_text - __START_KERNEL_map` from it. We know that compiled address of the `_text` is `0xffffffff81000000` and `__START_KERNEL_map` contains `0xffffffff81000000`, so `rbp` will contain physical address of the `text` - `0x1000000` after this calculation. We need to calculate it because kernel can't be run on the default address, but now we know the actual physical address.
In the next step we checks that this address is aligned with:
@ -122,7 +122,7 @@ The first step before we started to setup identity paging, need to correct follo
addq %rbp, level2_fixmap_pgt + (506*8)(%rip)
```
Here we need to correct `early_level4_pgt` and other addresses of the page table directories, because as I wrote above, kernel can be runned not at the default `0x1000000` address. `rbp` register contains actuall address so we add to the `early_level4_pgt`, `level3_kernel_pgt` and `level2_fixmap_pgt`. Let's try to understand what this labels means. First of all let's look on their definition:
Here we need to correct `early_level4_pgt` and other addresses of the page table directories, because as I wrote above, kernel can't be run at the default `0x1000000` address. `rbp` register contains actual address so we add to the `early_level4_pgt`, `level3_kernel_pgt` and `level2_fixmap_pgt`. Let's try to understand what these labels means. First of all let's look on their definition:
```assembly
NEXT_PAGE(early_level4_pgt)
@ -385,7 +385,7 @@ INIT_PER_CPU(gdt_page);
As we got `init_per_cpu__gdt_page` in `INIT_PER_CPU_VAR` and `INIT_PER_CPU` macro from linker script will be expanded we will get offset from the `__per_cpu_load`. After this calculations, we will have correct base address of the new GDT.
Generally per-CPU variables is a 2.6 kernel feature. You can understand what is it from it's name. When we create `per-CPU` variable, each CPU will have will have it's own copy of this variable. Here we creating `gdt_page` per-CPU variable. There are many advantages for variables of this type, like there are no locks, because each CPU works with it's own copy of variable and etc... So every core on multiprocessor will have it's own `GDT` table and every entry in the table will represent a memory segment which can be accessed from the thread which runned on the core. You can read in details about `per-CPU` variables in the [Theory/per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) post.
Generally per-CPU variables is a 2.6 kernel feature. You can understand what is it from it's name. When we create `per-CPU` variable, each CPU will have will have it's own copy of this variable. Here we creating `gdt_page` per-CPU variable. There are many advantages for variables of this type, like there are no locks, because each CPU works with it's own copy of variable and etc... So every core on multiprocessor will have it's own `GDT` table and every entry in the table will represent a memory segment which can be accessed from the thread which ran on the core. You can read in details about `per-CPU` variables in the [Theory/per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) post.
As we loaded new Global Descriptor Table, we reload segments as we did it every time:

@ -230,9 +230,9 @@ and a couple of directories depends on the different configuration options:
In the end of the `proc_root_init` we call the `proc_sys_init` function which creates `/proc/sys` directory and initializes the [Sysctl](http://en.wikipedia.org/wiki/Sysctl).
It is the end of `start_kernel` function. I did not describe all functions which are called in the `start_kernel`. I missed it, because they are not so improtant for the generic kernel initialization stuff and depend on only different kernel configurations. They are `taskstats_init_early` which exports per-task statistic to the user-space, `delayacct_init` - initializes per-task delay accounting, `key_init` and `security_init` initialize diferent security stuff, `check_bugs` - makes fix up of the some architecture-dependent bugs, `ftrace_init` function executes initialization of the [ftrace](https://www.kernel.org/doc/Documentation/trace/ftrace.txt), `cgroup_init` makes initialization of the rest of the [cgroup](http://en.wikipedia.org/wiki/Cgroups) subsystem and etc... Many of these parts and subsystems will be described in the other chapters.
It is the end of `start_kernel` function. I did not describe all functions which are called in the `start_kernel`. I missed it, because they are not so important for the generic kernel initialization stuff and depend on only different kernel configurations. They are `taskstats_init_early` which exports per-task statistic to the user-space, `delayacct_init` - initializes per-task delay accounting, `key_init` and `security_init` initialize diferent security stuff, `check_bugs` - makes fix up of the some architecture-dependent bugs, `ftrace_init` function executes initialization of the [ftrace](https://www.kernel.org/doc/Documentation/trace/ftrace.txt), `cgroup_init` makes initialization of the rest of the [cgroup](http://en.wikipedia.org/wiki/Cgroups) subsystem and etc... Many of these parts and subsystems will be described in the other chapters.
That's all. Finally we passed through the long-long `start_kernel` function. But it is not the end of the linux kernel initialization process. We have no runned first process yet. In the end of the `start_kernel` we can see the last call of the - `rest_init` function. Let's go ahead.
That's all. Finally we passed through the long-long `start_kernel` function. But it is not the end of the linux kernel initialization process. We haven't run the first process yet. In the end of the `start_kernel` we can see the last call of the - `rest_init` function. Let's go ahead.
First steps after the start_kernel
--------------------------------------------------------------------------------
@ -314,7 +314,7 @@ void init_idle_bootup_task(struct task_struct *idle)
}
```
where `idle` class is a low priority tasks and tasks can be runned only when the processor has not to run anything besides this tasks. The second function `schedule_preempt_disabled` disables preempt in `idle` tasks. And the third function `cpu_startup_entry` defined in the [kernel/sched/idle.c](https://github.com/torvalds/linux/blob/master/sched/idle.c) and calls `cpu_idle_loop` from the [kernel/sched/idle.c](https://github.com/torvalds/linux/blob/master/sched/idle.c). The `cpu_idle_loop` function works as process with `PID = 0` and works in the background. Main purpose of the `cpu_idle_loop` is usage of the idle CPU cycles. When there are no one process to run, this process starts to work. We have one process with `idle` scheduling class (we just set the `current` task to the `idle` with the call of the `init_idle_bootup_task` function), so the `idle` thread does not do useful work and checks that there is not active task to switch:
where `idle` class is a low priority tasks and tasks can be run only when the processor doesn't have to run anything besides this tasks. The second function `schedule_preempt_disabled` disables preempt in `idle` tasks. And the third function `cpu_startup_entry` defined in the [kernel/sched/idle.c](https://github.com/torvalds/linux/blob/master/sched/idle.c) and calls `cpu_idle_loop` from the [kernel/sched/idle.c](https://github.com/torvalds/linux/blob/master/sched/idle.c). The `cpu_idle_loop` function works as process with `PID = 0` and works in the background. Main purpose of the `cpu_idle_loop` is usage of the idle CPU cycles. When there are no one process to run, this process starts to work. We have one process with `idle` scheduling class (we just set the `current` task to the `idle` with the call of the `init_idle_bootup_task` function), so the `idle` thread does not do useful work and checks that there is not active task to switch:
```C
static void cpu_idle_loop(void)

@ -441,7 +441,7 @@ init_idle(current, smp_processor_id());
calc_load_update = jiffies + LOAD_FREQ;
```
So, the `init` process will be runned, when there will no other candidates (as it first process in the system). In the end we just set `scheduler_running` variable:
So, the `init` process will be run, when there will be no other candidates (as it is the first process in the system). In the end we just set `scheduler_running` variable:
```C
scheduler_running = 1;

@ -24,6 +24,8 @@
* [Initialization of non-early interrupt gates](interrupts/interrupts-4.md)
* [Implementation of some exception handlers](interrupts/interrupts-5.md)
* [Handling Non-Maskable interrupts](interrupts/interrupts-6.md)
* [Dive into external hardware interrupts](interrupts/interrupts-7.md)
* [Initialization of external hardware interrupts structures](interrupts/interrupts-8.md)
* [Memory management](mm/README.md)
* [Memblock](mm/linux-mm-1.md)
* [Fixmaps and ioremap](mm/linux-mm-2.md)
@ -43,7 +45,7 @@
* [Initial ram disk]()
* [initrd]()
* [Misc](Misc/README.md)
* [Kernel building and instalation]()
* [How kernel compiled]()
* [Write and Submit your first Linux kernel Patch]()
* [Data types in the kernel]()
* [Useful links](LINKS.md)

@ -63,3 +63,4 @@ Thank you to all contributors:
* [Adam Shannon](https://github.com/adamdecaf)
* [Donny Nadolny](https://github.com/dnadolny)
* [Ehsun N](https://github.com/imehsunn)
* [Waqar Ahmed](https://github.com/Waqar144)

@ -8,3 +8,5 @@ You will find a couple of posts which describes an interrupts and an exceptions
* [Interrupt handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-4.md) - fourth part describes first non-early interrupt handlers.
* [Implementation of exception handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-5.md) - descripbes implementation of some exception handlers as double fault, divide by zero and etc.
* [Handling Non-Maskable interrupts](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-6.md) - describes handling of non-maskable interrupts and the rest of interrupts handlers from the architecture-specific part.
* [Dive into external hardware interrupts](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-7.md) - this part describes early initialization of code which is related to handling of external hardware interrupts.
* [Non-early initialization of the IRQs](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-8.md) - this part describes non-early initialization of code which is related to handling of external hardware interrupts.

@ -4,9 +4,9 @@ Interrupts and Interrupt Handling. Part 6.
Non-maskable interrupt handler
--------------------------------------------------------------------------------
It is sixth part of the [Interrupts and Interrupt Handling in the Linux kernel](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-5.html) we saw implementation of some exception handlers for the [General Protection Fault](https://en.wikipedia.org/wiki/General_protection_fault) exception, divide excetpion, invalid [opcode](https://en.wikipedia.org/wiki/Opcode) excetpion and etc. As I wrote in the previous part we will see implementations of the rest excetpions in this part. We will see implementation of the following handlers:
It is sixth part of the [Interrupts and Interrupt Handling in the Linux kernel](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-5.html) we saw implementation of some exception handlers for the [General Protection Fault](https://en.wikipedia.org/wiki/General_protection_fault) exception, divide exception, invalid [opcode](https://en.wikipedia.org/wiki/Opcode) exceptions and etc. As I wrote in the previous part we will see implementations of the rest exceptions in this part. We will see implementation of the following handlers:
* [Non-Masksable](https://en.wikipedia.org/wiki/Non-maskable_interrupt) interrupt;
* [Non-Maskable](https://en.wikipedia.org/wiki/Non-maskable_interrupt) interrupt;
* [BOUND](http://pdos.csail.mit.edu/6.828/2005/readings/i386/BOUND.htm) Range Exceeded Exception;
* [Coprocessor](https://en.wikipedia.org/wiki/Coprocessor) exception;
* [SIMD](https://en.wikipedia.org/wiki/SIMD) coprocessor exception.
@ -16,7 +16,7 @@ in this part. So, let's start.
Non-Maskable interrupt handling
--------------------------------------------------------------------------------
A [Non-Masksable](https://en.wikipedia.org/wiki/Non-maskable_interrupt) interrupt is a hardware interrupt that cannot be ignore by standard masking techniques. In a general way, a non-maskable interrupt can be generated in either of two ways:
A [Non-Maskable](https://en.wikipedia.org/wiki/Non-maskable_interrupt) interrupt is a hardware interrupt that cannot be ignore by standard masking techniques. In a general way, a non-maskable interrupt can be generated in either of two ways:
* External hardware asserts the non-maskable interrupt [pin](https://en.wikipedia.org/wiki/CPU_socket) on the CPU.
* The processor receives a message on the system bus or the APIC serial bus with a delivery mode `NMI`.
@ -49,7 +49,7 @@ ENTRY(nmi)
END(nmi)
```
in the same [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file. Let's dive into it and will try to understand how `Non-Maskable` interrupt handler works. The `nmi` handlers starts from the call of the:
in the same [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file. Lets dive into it and will try to understand how `Non-Maskable` interrupt handler works. The `nmi` handlers starts from the call of the:
```assembly
PARAVIRT_ADJUST_EXCEPTION_FRAME
@ -123,7 +123,7 @@ pushq 11*8(%rsp)
.endr
```
with the [.rept](http://tigcc.ticalc.org/doc/gnuasm.html#SEC116) assembly directive. We need in the copy of the original stack frame. Generally we need in two copies of the interrupt stack. First is `copied` interrupts stack: `saved` stack frame and `copied` stack frame. Now we pushes original stack frame to the `saved` stack frame which locates after the just allocated `40` bytes (`copied` stack frame). This stack frame is used to fixup the `copied` stack frame that a nested NMI may change. The second - `copied` stack frame modifed by any nested `NMIs` to let the first `NMI` know that we triggered a second `NMI` and we shoult rebepat the first `NMI` handler. Ok, we have made first copy of the original stack frame, now time to make second copy:
with the [.rept](http://tigcc.ticalc.org/doc/gnuasm.html#SEC116) assembly directive. We need in the copy of the original stack frame. Generally we need in two copies of the interrupt stack. First is `copied` interrupts stack: `saved` stack frame and `copied` stack frame. Now we pushes original stack frame to the `saved` stack frame which locates after the just allocated `40` bytes (`copied` stack frame). This stack frame is used to fixup the `copied` stack frame that a nested NMI may change. The second - `copied` stack frame modified by any nested `NMIs` to let the first `NMI` know that we triggered a second `NMI` and we should repeat the first `NMI` handler. Ok, we have made first copy of the original stack frame, now time to make second copy:
```assembly
addq $(10*8), %rsp
@ -162,7 +162,7 @@ After all of these manipulations our stack frame will be like this:
+-------------------------+
```
After this we push dummy error code on the stack as we did it already in the previous exception handlers and allocate space for the general purpose regiseters on the stack:
After this we push dummy error code on the stack as we did it already in the previous exception handlers and allocate space for the general purpose registers on the stack:
```assembly
pushq $-1
@ -183,7 +183,7 @@ After space allocation for the general registers we can see call of the `paranoi
call paranoid_entry
```
We can remember from the previous parts this label. It pushes general purpose regisrers on the stack, reads `MSR_GS_BASE` [Model Specific regiser](https://en.wikipedia.org/wiki/Model-specific_register) and checks its value. If the value of the `MSR_GS_BASE` is negative, we came from the kernel mode and just return from the `paranoid_entry`, in other way it means that we came from the usermode and need to execeute `swapgs` instruction which will change user `gs` with the kernel `gs`:
We can remember from the previous parts this label. It pushes general purpose registers on the stack, reads `MSR_GS_BASE` [Model Specific register](https://en.wikipedia.org/wiki/Model-specific_register) and checks its value. If the value of the `MSR_GS_BASE` is negative, we came from the kernel mode and just return from the `paranoid_entry`, in other way it means that we came from the usermode and need to execute `swapgs` instruction which will change user `gs` with the kernel `gs`:
```assembly
ENTRY(paranoid_entry)
@ -201,7 +201,7 @@ ENTRY(paranoid_entry)
END(paranoid_entry)
```
Note that after the `swapgs` instruction we zeroed the `ebx` register. Next time we will check content of this register and if we executed `swapgs` than `ebx` must contain `0` and `1` in other way. In the next step we store value of the `cr2` [control register](https://en.wikipedia.org/wiki/Control_register) to the `r12` register, because the `NMI` handler can cause `page fault` and corrup the value of this control register:
Note that after the `swapgs` instruction we zeroed the `ebx` register. Next time we will check content of this register and if we executed `swapgs` than `ebx` must contain `0` and `1` in other way. In the next step we store value of the `cr2` [control register](https://en.wikipedia.org/wiki/Control_register) to the `r12` register, because the `NMI` handler can cause `page fault` and corrupt the value of this control register:
```C
movq %cr2, %r12
@ -215,7 +215,7 @@ movq $-1, %rsi
call do_nmi
```
We will back to the `do_nmi` little later in this part, but now let's look what occurs after the `do_nmi` will finish its exceution. After the `do_nmi` handler will be finished we check the `cr2` register, because we can got page fault during `do_nmi` performed and if we got it we restore original `cr2`, in other way we jump on the label `1`. After this we test content of the `ebx` register (remember it must contain `0` if we have used `swapgs` instruction and `1` if we didn't use it) and execute `SWAPGS_UNSAFE_STACK` if it contains `1` or jump to the `nmi_restore` label. The `SWAPGS_UNSAFE_STACK` macro just expands to the `swapgs` instruction. In the `nmi_restore` label we restore general purpose registers, clear allocated space on the stack for this registers clear our temporary variable and exit from the interrupt handler with the `INTERRUPT_RETURN` macro:
We will back to the `do_nmi` little later in this part, but now let's look what occurs after the `do_nmi` will finish its execution. After the `do_nmi` handler will be finished we check the `cr2` register, because we can got page fault during `do_nmi` performed and if we got it we restore original `cr2`, in other way we jump on the label `1`. After this we test content of the `ebx` register (remember it must contain `0` if we have used `swapgs` instruction and `1` if we didn't use it) and execute `SWAPGS_UNSAFE_STACK` if it contains `1` or jump to the `nmi_restore` label. The `SWAPGS_UNSAFE_STACK` macro just expands to the `swapgs` instruction. In the `nmi_restore` label we restore general purpose registers, clear allocated space on the stack for this registers clear our temporary variable and exit from the interrupt handler with the `INTERRUPT_RETURN` macro:
```assembly
movq %cr2, %rcx
@ -239,14 +239,14 @@ nmi_restore:
where `INTERRUPT_RETURN` is defined in the [arch/x86/include/irqflags.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/irqflags.h) and just expands to the `iret` instruction. That's all.
Now let's consider case when another `NMI` interrupt occured when previous `NMI` interrupt didn't finish its execution. You can remember from the beginnig of this part that we've made a check that we came from userspace and jump on the `first_nmi` in this case:
Now let's consider case when another `NMI` interrupt occurred when previous `NMI` interrupt didn't finish its execution. You can remember from the beginning of this part that we've made a check that we came from userspace and jump on the `first_nmi` in this case:
```assembly
cmpl $__KERNEL_CS, 16(%rsp)
jne first_nmi
```
Note that in this case it is first `NMI` everytime, because if the first `NMI` catched page fault, breakpoint or another exception it will be executed in the kernel mode. If we didn't come from userspace, first of all we test our temporary variable:
Note that in this case it is first `NMI` every time, because if the first `NMI` catched page fault, breakpoint or another exception it will be executed in the kernel mode. If we didn't come from userspace, first of all we test our temporary variable:
```assembly
cmpl $1, -8(%rsp)
@ -310,7 +310,7 @@ That's all.
Range Exceeded Exception
--------------------------------------------------------------------------------
The next exception is the `BOUND` range exceeded exception. The `BOUND` instruction determines if the first operand (array index) is within the bounds of an array specified the second operand (bounds operand). If the index is not within bounds, a `BOUND` range exceeded exception or `#BR` is occured. The handler of the `#BR` exception is the `do_bounds` function that defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c). The `do_bounds` handler starts with the call of the `exception_enter` function and ends with the call of the `exception_exit`:
The next exception is the `BOUND` range exceeded exception. The `BOUND` instruction determines if the first operand (array index) is within the bounds of an array specified the second operand (bounds operand). If the index is not within bounds, a `BOUND` range exceeded exception or `#BR` is occurred. The handler of the `#BR` exception is the `do_bounds` function that defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c). The `do_bounds` handler starts with the call of the `exception_enter` function and ends with the call of the `exception_exit`:
```C
prev_state = exception_enter();
@ -457,7 +457,7 @@ Links
* [General Protection Fault](https://en.wikipedia.org/wiki/General_protection_fault)
* [opcode](https://en.wikipedia.org/wiki/Opcode)
* [Non-Masksable](https://en.wikipedia.org/wiki/Non-maskable_interrupt)
* [Non-Maskable](https://en.wikipedia.org/wiki/Non-maskable_interrupt)
* [BOUND instruction](http://pdos.csail.mit.edu/6.828/2005/readings/i386/BOUND.htm)
* [CPU socket](https://en.wikipedia.org/wiki/CPU_socket)
* [Interrupt Descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table)

Loading…
Cancel
Save