1
0
mirror of https://github.com/0xAX/linux-insides.git synced 2024-12-22 22:58:08 +00:00

Merge branch 'master' into linux-interrupts-2-update

This commit is contained in:
0xAX 2020-06-14 00:05:59 +06:00 committed by GitHub
commit 58928da0bb
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
55 changed files with 356 additions and 356 deletions

View File

@ -164,7 +164,7 @@ just as explained above. We have only 16-bit general purpose registers, which ha
'0x10ffef' '0x10ffef'
``` ```
where `0x10ffef` is equal to `1MB + 64KB - 16b`. An [8086](https://en.wikipedia.org/wiki/Intel_8086) processor (which was the first processor with real mode), in contrast, has a 20-bit address line. Since `2^20 = 1048576` is 1MB, this means that the actual available memory is 1MB. where `0x10ffef` is equal to `(1MB + 64KB - 16B) - 1`. An [8086](https://en.wikipedia.org/wiki/Intel_8086) processor (which was the first processor with real mode), in contrast, has a 20-bit address line. Since `2^20 = 1048576` is 1MB and `2^20 - 1` is the maximum address that could be used, this means that the actual available memory is 1MB.
In general, real mode's memory map is as follows: In general, real mode's memory map is as follows:

View File

@ -137,7 +137,7 @@ Now that we have our bearings, let's look at the contents of the `startup_32` fu
In the beginning of the `startup_32` function, we can see the `cld` instruction which clears the `DF` bit in the [flags](https://en.wikipedia.org/wiki/FLAGS_register) register. When the direction flag is clear, all string operations like [stos](http://x86.renejeschke.de/html/file_module_x86_id_306.html), [scas](http://x86.renejeschke.de/html/file_module_x86_id_287.html) and others will increment the index registers `esi` or `edi`. We need to clear the direction flag because later we will use strings operations to perform various operations such as clearing space for page tables. In the beginning of the `startup_32` function, we can see the `cld` instruction which clears the `DF` bit in the [flags](https://en.wikipedia.org/wiki/FLAGS_register) register. When the direction flag is clear, all string operations like [stos](http://x86.renejeschke.de/html/file_module_x86_id_306.html), [scas](http://x86.renejeschke.de/html/file_module_x86_id_287.html) and others will increment the index registers `esi` or `edi`. We need to clear the direction flag because later we will use strings operations to perform various operations such as clearing space for page tables.
After we have cleared the `DF` bit, the next step is to check the `KEEP_SEGMENTS` flag in the `loadflags` kernel setup header field. If you remember, we already talked about `loadflags` in the very first [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-1.html) of this book. There we checked the `CAN_USE_HEAP` flag to query the ability to use the heap. Now we need to check the `KEEP_SEGMENTS` flag. This flag is described in the linux [boot protocol](https://www.kernel.org/doc/Documentation/x86/boot.txt) documentation: After we have cleared the `DF` bit, the next step is to check the `KEEP_SEGMENTS` flag in the `loadflags` kernel setup header field. If you remember, we already talked about `loadflags` in the very first [part](https://0xax.gitbook.io/linux-insides/summary/booting/linux-bootstrap-1) of this book. There we checked the `CAN_USE_HEAP` flag to query the ability to use the heap. Now we need to check the `KEEP_SEGMENTS` flag. This flag is described in the linux [boot protocol](https://www.kernel.org/doc/Documentation/x86/boot.txt) documentation:
``` ```
Bit 6 (write): KEEP_SEGMENTS Bit 6 (write): KEEP_SEGMENTS

View File

@ -4,7 +4,7 @@ Control Groups
Introduction Introduction
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is the first part of the new chapter of the [linux insides](https://0xax.gitbooks.io/linux-insides/content/) book and as you may guess by part's name - this part will cover [control groups](https://en.wikipedia.org/wiki/Cgroups) or `cgroups` mechanism in the Linux kernel. This is the first part of the new chapter of the [linux insides](https://github.com/0xAX/linux-insides/blob/master/SUMMARY.md) book and as you may guess by part's name - this part will cover [control groups](https://en.wikipedia.org/wiki/Cgroups) or `cgroups` mechanism in the Linux kernel.
`Cgroups` are special mechanism provided by the Linux kernel which allows us to allocate kind of `resources` like processor time, number of processes per group, amount of memory per control group or combination of such resources for a process or set of processes. `Cgroups` are organized hierarchically and here this mechanism is similar to usual processes as they are hierarchical too and child `cgroups` inherit set of certain parameters from their parents. But actually they are not the same. The main differences between `cgroups` and normal processes that many different hierarchies of control groups may exist simultaneously in one time while normal process tree is always single. This was not a casual step because each control group hierarchy is attached to set of control group `subsystems`. `Cgroups` are special mechanism provided by the Linux kernel which allows us to allocate kind of `resources` like processor time, number of processes per group, amount of memory per control group or combination of such resources for a process or set of processes. `Cgroups` are organized hierarchically and here this mechanism is similar to usual processes as they are hierarchical too and child `cgroups` inherit set of certain parameters from their parents. But actually they are not the same. The main differences between `cgroups` and normal processes that many different hierarchies of control groups may exist simultaneously in one time while normal process tree is always single. This was not a casual step because each control group hierarchy is attached to set of control group `subsystems`.
@ -448,4 +448,4 @@ Links
* [bash](https://www.gnu.org/software/bash/) * [bash](https://www.gnu.org/software/bash/)
* [docker](https://en.wikipedia.org/wiki/Docker_\(software\)) * [docker](https://en.wikipedia.org/wiki/Docker_\(software\))
* [perf events](https://en.wikipedia.org/wiki/Perf_\(Linux\)) * [perf events](https://en.wikipedia.org/wiki/Perf_\(Linux\))
* [Previous chapter](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-1.html) * [Previous chapter](https://0xax.gitbook.io/linux-insides/summary/mm/linux-mm-1)

View File

@ -92,7 +92,7 @@ Where the `percpu_alloc_setup` function sets the `pcpu_chosen_fc` variable depen
enum pcpu_fc pcpu_chosen_fc __initdata = PCPU_FC_AUTO; enum pcpu_fc pcpu_chosen_fc __initdata = PCPU_FC_AUTO;
``` ```
If the `percpu_alloc` parameter is not given to the kernel command line, the `embed` allocator will be used which embeds the first percpu chunk into bootmem with the [memblock](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-1.html). The last allocator is the first chunk `page` allocator which maps the first chunk with `PAGE_SIZE` pages. If the `percpu_alloc` parameter is not given to the kernel command line, the `embed` allocator will be used which embeds the first percpu chunk into bootmem with the [memblock](https://0xax.gitbook.io/linux-insides/summary/mm/linux-mm-1). The last allocator is the first chunk `page` allocator which maps the first chunk with `PAGE_SIZE` pages.
As I wrote above, first of all we make a check of the first chunk allocator type in the `setup_per_cpu_areas`. We check that first chunk allocator is not page: As I wrote above, first of all we make a check of the first chunk allocator type in the `setup_per_cpu_areas`. We check that first chunk allocator is not page:

View File

@ -10,7 +10,7 @@ Introduction
* [lib/cpumask.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/lib/cpumask.c) * [lib/cpumask.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/lib/cpumask.c)
* [kernel/cpu.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/cpu.c) * [kernel/cpu.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/cpu.c)
As comment says from the [include/linux/cpumask.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/cpumask.h): Cpumasks provide a bitmap suitable for representing the set of CPU's in a system, one bit position per CPU number. We already saw a bit about cpumask in the `boot_cpu_init` function from the [Kernel entry point](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) part. This function makes first boot cpu online, active and etc...: As comment says from the [include/linux/cpumask.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/cpumask.h): Cpumasks provide a bitmap suitable for representing the set of CPU's in a system, one bit position per CPU number. We already saw a bit about cpumask in the `boot_cpu_init` function from the [Kernel entry point](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-4) part. This function makes first boot cpu online, active and etc...:
```C ```C
set_cpu_online(cpu, true); set_cpu_online(cpu, true);

View File

@ -224,7 +224,7 @@ If you are interested, you can find these sections in the `arch/x86/kernel/vmlin
} }
``` ```
If you are not familiar with this then you can know more about [linkers](https://en.wikipedia.org/wiki/Linker_%28computing%29) in the special [part](https://0xax.gitbooks.io/linux-insides/content/Misc/linux-misc-3.html) of this book. If you are not familiar with this then you can know more about [linkers](https://en.wikipedia.org/wiki/Linker_%28computing%29) in the special [part](https://0xax.gitbook.io/linux-insides/summary/misc/linux-misc-3) of this book.
As we just saw, the `do_initcall_level` function takes one parameter - level of `initcall` - and does the following two things: As we just saw, the `do_initcall_level` function takes one parameter - level of `initcall` - and does the following two things:
@ -415,9 +415,9 @@ Links
* [symbols concatenation](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html) * [symbols concatenation](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html)
* [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) * [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)
* [Link time optimization](https://gcc.gnu.org/wiki/LinkTimeOptimization) * [Link time optimization](https://gcc.gnu.org/wiki/LinkTimeOptimization)
* [Introduction to linkers](https://0xax.gitbooks.io/linux-insides/content/Misc/linux-misc-3.html) * [Introduction to linkers](https://0xax.gitbook.io/linux-insides/summary/misc/linux-misc-3)
* [Linux kernel command line](https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/kernel-parameters.rst) * [Linux kernel command line](https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/kernel-parameters.rst)
* [Process identifier](https://en.wikipedia.org/wiki/Process_identifier) * [Process identifier](https://en.wikipedia.org/wiki/Process_identifier)
* [IRQs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) * [IRQs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
* [rootfs](https://en.wikipedia.org/wiki/Initramfs) * [rootfs](https://en.wikipedia.org/wiki/Initramfs)
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) * [previous part](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-2)

View File

@ -76,7 +76,7 @@ In the first case for the `blocking notifier chains`, callbacks will be called/e
The second `SRCU notifier chains` represent alternative form of `blocking notifier chains`. In the first case, blocking notifier chains uses `rw_semaphore` synchronization primitive to protect chain links. `SRCU` notifier chains run in process context too, but uses special form of [RCU](https://en.wikipedia.org/wiki/Read-copy-update) mechanism which is permissible to block in an read-side critical section. The second `SRCU notifier chains` represent alternative form of `blocking notifier chains`. In the first case, blocking notifier chains uses `rw_semaphore` synchronization primitive to protect chain links. `SRCU` notifier chains run in process context too, but uses special form of [RCU](https://en.wikipedia.org/wiki/Read-copy-update) mechanism which is permissible to block in an read-side critical section.
In the third case for the `atomic notifier chains` runs in interrupt or atomic context and protected by [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html) synchronization primitive. The last `raw notifier chains` provides special type of notifier chains without any locking restrictions on callbacks. This means that protection rests on the shoulders of caller side. It is very useful when we want to protect our chain with very specific locking mechanism. In the third case for the `atomic notifier chains` runs in interrupt or atomic context and protected by [spinlock](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-1) synchronization primitive. The last `raw notifier chains` provides special type of notifier chains without any locking restrictions on callbacks. This means that protection rests on the shoulders of caller side. It is very useful when we want to protect our chain with very specific locking mechanism.
If we will look at the implementation of the `notifier_block` structure, we will see that it contains pointer to the `next` element from a notification chain list, but we have no head. Actually a head of such list is in separate structure depends on type of a notification chain. For example for the `blocking notifier chains`: If we will look at the implementation of the `notifier_block` structure, we will see that it contains pointer to the `next` element from a notification chain list, but we have no head. Actually a head of such list is in separate structure depends on type of a notification chain. For example for the `blocking notifier chains`:
@ -118,7 +118,7 @@ which defines head for loadable modules blocking notifier chain. The `BLOCKING_N
} while (0) } while (0)
``` ```
So we may see that it takes name of a name of a head of a blocking notifier chain and initializes read/write [semaphore](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-3.html) and set head to `NULL`. Besides the `BLOCKING_INIT_NOTIFIER_HEAD` macro, the Linux kernel additionally provides `ATOMIC_INIT_NOTIFIER_HEAD`, `RAW_INIT_NOTIFIER_HEAD` macros and `srcu_init_notifier` function for initialization atomic and other types of notification chains. So we may see that it takes name of a name of a head of a blocking notifier chain and initializes read/write [semaphore](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-3) and set head to `NULL`. Besides the `BLOCKING_INIT_NOTIFIER_HEAD` macro, the Linux kernel additionally provides `ATOMIC_INIT_NOTIFIER_HEAD`, `RAW_INIT_NOTIFIER_HEAD` macros and `srcu_init_notifier` function for initialization atomic and other types of notification chains.
After initialization of a head of a notification chain, a subsystem which wants to receive notification from the given notification chain it should register with certain function which is depends on type of notification. If you will look in the [include/linux/notifier.h](https://github.com/torvalds/linux/blob/master/include/linux/notifier.h) header file, you will see following four function for this: After initialization of a head of a notification chain, a subsystem which wants to receive notification from the given notification chain it should register with certain function which is depends on type of notification. If you will look in the [include/linux/notifier.h](https://github.com/torvalds/linux/blob/master/include/linux/notifier.h) header file, you will see following four function for this:
@ -331,7 +331,7 @@ static struct notifier_block tracepoint_module_nb = {
}; };
``` ```
When one of the `MODULE_STATE_LIVE`, `MODULE_STATE_COMING` or `MODULE_STATE_GOING` events occurred. For example the `MODULE_STATE_LIVE` the `MODULE_STATE_COMING` notifications will be sent during execution of the [init_module](http://man7.org/linux/man-pages/man2/init_module.2.html) [system call](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-1.html). Or for example `MODULE_STATE_GOING` will be sent during execution of the [delete_module](http://man7.org/linux/man-pages/man2/delete_module.2.html) `system call`: When one of the `MODULE_STATE_LIVE`, `MODULE_STATE_COMING` or `MODULE_STATE_GOING` events occurred. For example the `MODULE_STATE_LIVE` the `MODULE_STATE_COMING` notifications will be sent during execution of the [init_module](http://man7.org/linux/man-pages/man2/init_module.2.html) [system call](https://0xax.gitbook.io/linux-insides/summary/syscall/linux-syscall-1). Or for example `MODULE_STATE_GOING` will be sent during execution of the [delete_module](http://man7.org/linux/man-pages/man2/delete_module.2.html) `system call`:
```C ```C
SYSCALL_DEFINE2(delete_module, const char __user *, name_user, SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
@ -359,11 +359,11 @@ Links
* [API](https://en.wikipedia.org/wiki/Application_programming_interface) * [API](https://en.wikipedia.org/wiki/Application_programming_interface)
* [callback](https://en.wikipedia.org/wiki/Callback_%28computer_programming%29) * [callback](https://en.wikipedia.org/wiki/Callback_%28computer_programming%29)
* [RCU](https://en.wikipedia.org/wiki/Read-copy-update) * [RCU](https://en.wikipedia.org/wiki/Read-copy-update)
* [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html) * [spinlock](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-1)
* [loadable modules](https://en.wikipedia.org/wiki/Loadable_kernel_module) * [loadable modules](https://en.wikipedia.org/wiki/Loadable_kernel_module)
* [semaphore](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-3.html) * [semaphore](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-3)
* [tracepoints](https://www.kernel.org/doc/Documentation/trace/tracepoints.txt) * [tracepoints](https://www.kernel.org/doc/Documentation/trace/tracepoints.txt)
* [system call](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-1.html) * [system call](https://0xax.gitbook.io/linux-insides/summary/syscall/linux-syscall-1)
* [init_module system call](http://man7.org/linux/man-pages/man2/init_module.2.html) * [init_module system call](http://man7.org/linux/man-pages/man2/init_module.2.html)
* [delete_module](http://man7.org/linux/man-pages/man2/delete_module.2.html) * [delete_module](http://man7.org/linux/man-pages/man2/delete_module.2.html)
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-3.html) * [previous part](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-3)

View File

@ -13,7 +13,7 @@ Besides these two files, there is also architecture-specific header file which p
* [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/bitops.h) * [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/bitops.h)
header file. As I just wrote above, the `bitmap` is heavily used in the Linux kernel. For example a `bit array` is used to store set of online/offline processors for systems which support [hot-plug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) cpu (more about this you can read in the [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) part), a `bit array` stores set of allocated [irqs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) during initialization of the Linux kernel and etc. header file. As I just wrote above, the `bitmap` is heavily used in the Linux kernel. For example a `bit array` is used to store set of online/offline processors for systems which support [hot-plug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) cpu (more about this you can read in the [cpumasks](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-2) part), a `bit array` stores set of allocated [irqs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) during initialization of the Linux kernel and etc.
So, the main goal of this part is to see how `bit arrays` are implemented in the Linux kernel. Let's start. So, the main goal of this part is to see how `bit arrays` are implemented in the Linux kernel. Let's start.
@ -365,7 +365,7 @@ Links
* [linked data structures](https://en.wikipedia.org/wiki/Linked_data_structure) * [linked data structures](https://en.wikipedia.org/wiki/Linked_data_structure)
* [tree data structures](https://en.wikipedia.org/wiki/Tree_%28data_structure%29) * [tree data structures](https://en.wikipedia.org/wiki/Tree_%28data_structure%29)
* [hot-plug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) * [hot-plug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)
* [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) * [cpumasks](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-2)
* [IRQs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) * [IRQs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
* [API](https://en.wikipedia.org/wiki/Application_programming_interface) * [API](https://en.wikipedia.org/wiki/Application_programming_interface)
* [atomic operations](https://en.wikipedia.org/wiki/Linearizability) * [atomic operations](https://en.wikipedia.org/wiki/Linearizability)

View File

@ -4,9 +4,9 @@ Kernel initialization. Part 1.
First steps in the kernel code First steps in the kernel code
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
The previous [post](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-6.html) was a last part of the Linux kernel [booting process](https://0xax.gitbooks.io/linux-insides/content/Booting/index.html) chapter and now we are starting to dive into initialization process of the Linux kernel. After the image of the Linux kernel is decompressed and placed in a correct place in memory, it starts to work. All previous parts describe the work of the Linux kernel setup code which does preparation before the first bytes of the Linux kernel code will be executed. From now we are in the kernel and all parts of this chapter will be devoted to the initialization process of the kernel before it will launch process with [pid](https://en.wikipedia.org/wiki/Process_identifier) `1`. There are many things to do before the kernel will start first `init` process. Hope we will see all of the preparations before kernel will start in this big chapter. We will start from the kernel entry point, which is located in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) and will move further and further. We will see first preparations like early page tables initialization, switch to a new descriptor in kernel space and many many more, before we will see the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) will be called. The previous [post](https://0xax.gitbook.io/linux-insides/summary/booting/linux-bootstrap-6) was a last part of the Linux kernel [booting process](https://0xax.gitbook.io/linux-insides/summary/booting) chapter and now we are starting to dive into initialization process of the Linux kernel. After the image of the Linux kernel is decompressed and placed in a correct place in memory, it starts to work. All previous parts describe the work of the Linux kernel setup code which does preparation before the first bytes of the Linux kernel code will be executed. From now we are in the kernel and all parts of this chapter will be devoted to the initialization process of the kernel before it will launch process with [pid](https://en.wikipedia.org/wiki/Process_identifier) `1`. There are many things to do before the kernel will start first `init` process. Hope we will see all of the preparations before kernel will start in this big chapter. We will start from the kernel entry point, which is located in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) and will move further and further. We will see first preparations like early page tables initialization, switch to a new descriptor in kernel space and many many more, before we will see the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) will be called.
In the last [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-6.html) of the previous [chapter](https://0xax.gitbooks.io/linux-insides/content/Booting/index.html) we stopped at the jmp instruction from the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) assembly source code file: In the last [part](https://0xax.gitbook.io/linux-insides/summary/booting/linux-bootstrap-6) of the previous [chapter](https://0xax.gitbook.io/linux-insides/summary/booting) we stopped at the jmp instruction from the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) assembly source code file:
```assembly ```assembly
jmp *%rax jmp *%rax
@ -110,7 +110,7 @@ After we got the delta, we check if `_text` address is correctly aligned for `2`
for (;;); for (;;);
``` ```
If `_text` address is not aligned for `2` megabytes, we enter infinite loop. The `PMD_PAGE_MASK` indicates the mask for `Page middle directory` (read [Paging](https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-1.html) about it) and is defined as: If `_text` address is not aligned for `2` megabytes, we enter infinite loop. The `PMD_PAGE_MASK` indicates the mask for `Page middle directory` (read [Paging](https://0xax.gitbook.io/linux-insides/summary/theory/linux-theory-1) about it) and is defined as:
```C ```C
#define PMD_PAGE_MASK (~(PMD_PAGE_SIZE-1)) #define PMD_PAGE_MASK (~(PMD_PAGE_SIZE-1))
@ -194,7 +194,7 @@ The `level2_kernel_pgt` is page table entry which contains pointer to the page m
The `level2_fixmap_pgt` is a virtual addresses which can refer to any physical addresses even under kernel space. They are represented by the `4048` bytes of zeros, the `level1_fixmap_pgt` entry, `8` megabytes reserved for [vsyscalls](https://lwn.net/Articles/446528/) mapping and `2` megabytes of hole. The `level2_fixmap_pgt` is a virtual addresses which can refer to any physical addresses even under kernel space. They are represented by the `4048` bytes of zeros, the `level1_fixmap_pgt` entry, `8` megabytes reserved for [vsyscalls](https://lwn.net/Articles/446528/) mapping and `2` megabytes of hole.
You can read more about it in the [Paging](https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-1.html) part. You can read more about it in the [Paging](https://0xax.gitbook.io/linux-insides/summary/theory/linux-theory-1) part.
Now, after we saw the definitions of these symbols, let's get back to the code. Next we initialize last entry of `pgd` with `level3_kernel_pgt`: Now, after we saw the definitions of these symbols, let's get back to the code. Next we initialize last entry of `pgd` with `level3_kernel_pgt`:
@ -549,7 +549,7 @@ INIT_PER_CPU(gdt_page);
As we got `init_per_cpu__gdt_page` in `INIT_PER_CPU_VAR` and `INIT_PER_CPU` macro from linker script will be expanded we will get offset from the `__per_cpu_load`. After this calculations, we will have correct base address of the new GDT. As we got `init_per_cpu__gdt_page` in `INIT_PER_CPU_VAR` and `INIT_PER_CPU` macro from linker script will be expanded we will get offset from the `__per_cpu_load`. After this calculations, we will have correct base address of the new GDT.
Generally per-CPU variables is a 2.6 kernel feature. You can understand what it is from its name. When we create `per-CPU` variable, each CPU will have its own copy of this variable. Here we are creating `gdt_page` per-CPU variable. There are many advantages for variables of this type, like there are no locks, because each CPU works with its own copy of variable and etc... So every core on multiprocessor will have its own `GDT` table and every entry in the table will represent a memory segment which can be accessed from the thread which ran on the core. You can read in details about `per-CPU` variables in the [Concepts/per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) post. Generally per-CPU variables is a 2.6 kernel feature. You can understand what it is from its name. When we create `per-CPU` variable, each CPU will have its own copy of this variable. Here we are creating `gdt_page` per-CPU variable. There are many advantages for variables of this type, like there are no locks, because each CPU works with its own copy of variable and etc... So every core on multiprocessor will have its own `GDT` table and every entry in the table will represent a memory segment which can be accessed from the thread which ran on the core. You can read in details about `per-CPU` variables in the [Concepts/per-cpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1) post.
As we loaded new Global Descriptor Table, we reload segments as we did it every time: As we loaded new Global Descriptor Table, we reload segments as we did it every time:
@ -683,7 +683,7 @@ Links
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
* [Model Specific Register](http://en.wikipedia.org/wiki/Model-specific_register) * [Model Specific Register](http://en.wikipedia.org/wiki/Model-specific_register)
* [Paging](https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-1.html) * [Paging](https://0xax.gitbook.io/linux-insides/summary/theory/linux-theory-1)
* [Previous part - kernel load address randomization](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-6.html) * [Previous part - kernel load address randomization](https://0xax.gitbook.io/linux-insides/summary/booting/linux-bootstrap-6)
* [NX](http://en.wikipedia.org/wiki/NX_bit) * [NX](http://en.wikipedia.org/wiki/NX_bit)
* [ASLR](http://en.wikipedia.org/wiki/Address_space_layout_randomization) * [ASLR](http://en.wikipedia.org/wiki/Address_space_layout_randomization)

View File

@ -4,7 +4,7 @@ Kernel initialization. Part 10.
End of the linux kernel initialization process End of the linux kernel initialization process
================================================================================ ================================================================================
This is tenth part of the chapter about linux kernel [initialization process](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and in the [previous part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-9.html) we saw the initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update) and stopped on the call of the `acpi_early_init` function. This part will be the last part of the [Kernel initialization process](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) chapter, so let's finish it. This is tenth part of the chapter about linux kernel [initialization process](https://0xax.gitbook.io/linux-insides/summary/initialization) and in the [previous part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-9) we saw the initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update) and stopped on the call of the `acpi_early_init` function. This part will be the last part of the [Kernel initialization process](https://0xax.gitbook.io/linux-insides/summary/initialization) chapter, so let's finish it.
After the call of the `acpi_early_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c), we can see the following code: After the call of the `acpi_early_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c), we can see the following code:
@ -185,7 +185,7 @@ nrpages = (nr_free_buffer_pages() * 10) / 100;
max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head)); max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head));
``` ```
which will be equal to the `10%` of the `ZONE_NORMAL` (all RAM from the 4GB on the `x86_64`). The next function after the `buffer_init` is - `vfs_caches_init`. This function allocates `SLAB` caches and hashtable for different [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) caches. We already saw the `vfs_caches_init_early` function in the eighth part of the linux kernel [initialization process](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-8.html) which initialized caches for `dcache` (or directory-cache) and [inode](http://en.wikipedia.org/wiki/Inode) cache. The `vfs_caches_init` function makes post-early initialization of the `dcache` and `inode` caches, private data cache, hash tables for the mount points, etc. More details about [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) will be described in the separate part. After this we can see `signals_init` function. This function is defined in the [kernel/signal.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/signal.c) and allocates a cache for the `sigqueue` structures which represents queue of the real time signals. The next function is `page_writeback_init`. This function initializes the ratio for the dirty pages. Every low-level page entry contains the `dirty` bit which indicates whether a page has been written to after been loaded into memory. which will be equal to the `10%` of the `ZONE_NORMAL` (all RAM from the 4GB on the `x86_64`). The next function after the `buffer_init` is - `vfs_caches_init`. This function allocates `SLAB` caches and hashtable for different [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) caches. We already saw the `vfs_caches_init_early` function in the eighth part of the linux kernel [initialization process](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-8) which initialized caches for `dcache` (or directory-cache) and [inode](http://en.wikipedia.org/wiki/Inode) cache. The `vfs_caches_init` function makes post-early initialization of the `dcache` and `inode` caches, private data cache, hash tables for the mount points, etc. More details about [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) will be described in the separate part. After this we can see `signals_init` function. This function is defined in the [kernel/signal.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/signal.c) and allocates a cache for the `sigqueue` structures which represents queue of the real time signals. The next function is `page_writeback_init`. This function initializes the ratio for the dirty pages. Every low-level page entry contains the `dirty` bit which indicates whether a page has been written to after been loaded into memory.
Creation of the root for the procfs Creation of the root for the procfs
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
@ -440,7 +440,7 @@ That's all! Linux kernel initialization process is finished!
Conclusion Conclusion
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
It is the end of the tenth part about the linux kernel [initialization process](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html). It is not only the `tenth` part, but also is the last part which describes initialization of the linux kernel. As I wrote in the first [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of this chapter, we will go through all steps of the kernel initialization and we did it. We started at the first architecture-independent function - `start_kernel` and finished with the launch of the first `init` process in the our system. I skipped details about different subsystem of the kernel, for example I almost did not cover scheduler, interrupts, exception handling, etc. From the next part we will start to dive to the different kernel subsystems. Hope it will be interesting. It is the end of the tenth part about the linux kernel [initialization process](https://0xax.gitbook.io/linux-insides/summary/initialization). It is not only the `tenth` part, but also is the last part which describes initialization of the linux kernel. As I wrote in the first [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-1) of this chapter, we will go through all steps of the kernel initialization and we did it. We started at the first architecture-independent function - `start_kernel` and finished with the launch of the first `init` process in the our system. I skipped details about different subsystem of the kernel, for example I almost did not cover scheduler, interrupts, exception handling, etc. From the next part we will start to dive to the different kernel subsystems. Hope it will be interesting.
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
@ -470,4 +470,4 @@ Links
* [Tmpfs](http://en.wikipedia.org/wiki/Tmpfs) * [Tmpfs](http://en.wikipedia.org/wiki/Tmpfs)
* [initrd](http://en.wikipedia.org/wiki/Initrd) * [initrd](http://en.wikipedia.org/wiki/Initrd)
* [panic](http://en.wikipedia.org/wiki/Kernel_panic) * [panic](http://en.wikipedia.org/wiki/Kernel_panic)
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-9.html) * [Previous part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-9)

View File

@ -4,9 +4,9 @@ Kernel initialization. Part 2.
Early interrupt and exception handling Early interrupt and exception handling
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
In the previous [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) we stopped before setting of early interrupt handlers. At this moment we are in the decompressed Linux kernel, we have basic [paging](https://en.wikipedia.org/wiki/Page_table) structure for early boot and our current goal is to finish early preparation before the main kernel code will start to work. In the previous [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-1) we stopped before setting of early interrupt handlers. At this moment we are in the decompressed Linux kernel, we have basic [paging](https://en.wikipedia.org/wiki/Page_table) structure for early boot and our current goal is to finish early preparation before the main kernel code will start to work.
We already started to do this preparation in the previous [first](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) part of this [chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html). We continue in this part and will know more about interrupt and exception handling. We already started to do this preparation in the previous [first](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-1) part of this [chapter](https://0xax.gitbook.io/linux-insides/summary/initialization). We continue in this part and will know more about interrupt and exception handling.
Remember that we stopped before following function: Remember that we stopped before following function:
@ -555,4 +555,4 @@ Links
* [Page table](https://en.wikipedia.org/wiki/Page_table) * [Page table](https://en.wikipedia.org/wiki/Page_table)
* [Interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler) * [Interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler)
* [Page Fault](https://en.wikipedia.org/wiki/Page_fault), * [Page Fault](https://en.wikipedia.org/wiki/Page_fault),
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) * [Previous part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-1)

View File

@ -76,7 +76,7 @@ After microcode was loaded we can see the check of the `console_loglevel` and th
Move on init pages Move on init pages
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
In the next step, as we have copied `boot_params` structure, we need to move from the early page tables to the page tables for initialization process. We already set early page tables for switchover, you can read about it in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) and dropped all it in the `reset_early_page_tables` function (you can read about it in the previous part too) and kept only kernel high mapping. After this we call: In the next step, as we have copied `boot_params` structure, we need to move from the early page tables to the page tables for initialization process. We already set early page tables for switchover, you can read about it in the previous [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-1) and dropped all it in the `reset_early_page_tables` function (you can read about it in the previous part too) and kept only kernel high mapping. After this we call:
```C ```C
clear_page(init_level4_pgt); clear_page(init_level4_pgt);

View File

@ -241,7 +241,7 @@ For now it is just zero. If the `CONFIG_DEBUG_PREEMPT` configuration option is d
#define raw_smp_processor_id() (this_cpu_read(cpu_number)) #define raw_smp_processor_id() (this_cpu_read(cpu_number))
``` ```
`this_cpu_read` as many other function like this (`this_cpu_write`, `this_cpu_add` and etc...) defined in the [include/linux/percpu-defs.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/percpu-defs.h) and presents `this_cpu` operation. These operations provide a way of optimizing access to the [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) variables which are associated with the current processor. In our case it is `this_cpu_read`: `this_cpu_read` as many other function like this (`this_cpu_write`, `this_cpu_add` and etc...) defined in the [include/linux/percpu-defs.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/percpu-defs.h) and presents `this_cpu` operation. These operations provide a way of optimizing access to the [per-cpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1) variables which are associated with the current processor. In our case it is `this_cpu_read`:
``` ```
__pcpu_size_call_return(this_cpu_read_, pcp) __pcpu_size_call_return(this_cpu_read_, pcp)
@ -346,7 +346,7 @@ static inline int __check_is_bitmap(const unsigned long *bitmap)
Yeah, it just returns `1` every time. Actually we need in it here only for one purpose: at compile time it checks that the given `bitmap` is a bitmap, or in other words it checks that the given `bitmap` has a type of `unsigned long *`. So we just pass `cpu_possible_bits` to the `to_cpumask` macro for converting the array of `unsigned long` to the `struct cpumask *`. Now we can call `cpumask_set_cpu` function with the `cpu` - 0 and `struct cpumask *cpu_possible_bits`. This function makes only one call of the `set_bit` function which sets the given `cpu` in the cpumask. All of these `set_cpu_*` functions work on the same principle. Yeah, it just returns `1` every time. Actually we need in it here only for one purpose: at compile time it checks that the given `bitmap` is a bitmap, or in other words it checks that the given `bitmap` has a type of `unsigned long *`. So we just pass `cpu_possible_bits` to the `to_cpumask` macro for converting the array of `unsigned long` to the `struct cpumask *`. Now we can call `cpumask_set_cpu` function with the `cpu` - 0 and `struct cpumask *cpu_possible_bits`. This function makes only one call of the `set_bit` function which sets the given `cpu` in the cpumask. All of these `set_cpu_*` functions work on the same principle.
If you're not sure that this `set_cpu_*` operations and `cpumask` are not clear for you, don't worry about it. You can get more info by reading the special part about it - [cpumask](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) or [documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt). If you're not sure that this `set_cpu_*` operations and `cpumask` are not clear for you, don't worry about it. You can get more info by reading the special part about it - [cpumask](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-2) or [documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt).
As we activated the bootstrap processor, it's time to go to the next function in the `start_kernel.` Now it is `page_address_init`, but this function does nothing in our case, because it executes only when all `RAM` can't be mapped directly. As we activated the bootstrap processor, it's time to go to the next function in the `start_kernel.` Now it is `page_address_init`, but this function does nothing in our case, because it executes only when all `RAM` can't be mapped directly.
@ -383,7 +383,7 @@ This function starts from the reserving memory block for the kernel `_text` and
memblock_reserve(__pa_symbol(_text), (unsigned long)__bss_stop - (unsigned long)_text); memblock_reserve(__pa_symbol(_text), (unsigned long)__bss_stop - (unsigned long)_text);
``` ```
You can read about `memblock` in the [Linux kernel memory management Part 1.](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-1.html). As you can remember `memblock_reserve` function takes two parameters: You can read about `memblock` in the [Linux kernel memory management Part 1.](https://0xax.gitbook.io/linux-insides/summary/mm/linux-mm-1). As you can remember `memblock_reserve` function takes two parameters:
* base physical address of a memory block; * base physical address of a memory block;
* size of a memory block. * size of a memory block.
@ -415,7 +415,7 @@ u64 ramdisk_size = get_ramdisk_size();
u64 ramdisk_end = PAGE_ALIGN(ramdisk_image + ramdisk_size); u64 ramdisk_end = PAGE_ALIGN(ramdisk_image + ramdisk_size);
``` ```
All of these parameters are taken from `boot_params`. If you have read the chapter about [Linux Kernel Booting Process](https://0xax.gitbooks.io/linux-insides/content/Booting/index.html), you must remember that we filled the `boot_params` structure during boot time. The kernel setup header contains a couple of fields which describes ramdisk, for example: All of these parameters are taken from `boot_params`. If you have read the chapter about [Linux Kernel Booting Process](https://0xax.gitbook.io/linux-insides/summary/booting), you must remember that we filled the `boot_params` structure during boot time. The kernel setup header contains a couple of fields which describes ramdisk, for example:
``` ```
Field name: ramdisk_image Field name: ramdisk_image

View File

@ -4,7 +4,7 @@ Kernel initialization. Part 5.
Continue of architecture-specific initialization Continue of architecture-specific initialization
================================================================================ ================================================================================
In the previous [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html), we stopped at the initialization of an architecture-specific stuff from the [setup_arch](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c#L856) function and now we will continue with it. As we reserved memory for the [initrd](http://en.wikipedia.org/wiki/Initrd), next step is the `olpc_ofw_detect` which detects [One Laptop Per Child support](http://wiki.laptop.org/go/OFW_FAQ). We will not consider platform related stuff in this book and will skip functions related with it. So let's go ahead. The next step is the `early_trap_init` function. This function initializes debug (`#DB` - raised when the `TF` flag of rflags is set) and `int3` (`#BP`) interrupts gate. If you don't know anything about interrupts, you can read about it in the [Early interrupt and exception handling](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html). In `x86` architecture `INT`, `INTO` and `INT3` are special instructions which allow a task to explicitly call an interrupt handler. The `INT3` instruction calls the breakpoint (`#BP`) handler. You may remember, we already saw it in the [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) about interrupts: and exceptions: In the previous [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-4), we stopped at the initialization of an architecture-specific stuff from the [setup_arch](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c#L856) function and now we will continue with it. As we reserved memory for the [initrd](http://en.wikipedia.org/wiki/Initrd), next step is the `olpc_ofw_detect` which detects [One Laptop Per Child support](http://wiki.laptop.org/go/OFW_FAQ). We will not consider platform related stuff in this book and will skip functions related with it. So let's go ahead. The next step is the `early_trap_init` function. This function initializes debug (`#DB` - raised when the `TF` flag of rflags is set) and `int3` (`#BP`) interrupts gate. If you don't know anything about interrupts, you can read about it in the [Early interrupt and exception handling](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-2). In `x86` architecture `INT`, `INTO` and `INT3` are special instructions which allow a task to explicitly call an interrupt handler. The `INT3` instruction calls the breakpoint (`#BP`) handler. You may remember, we already saw it in the [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-2) about interrupts: and exceptions:
``` ```
---------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------
@ -163,7 +163,7 @@ The next step is initialization of early `ioremap`. In general there are two way
* I/O Ports; * I/O Ports;
* Device memory. * Device memory.
We already saw first method (`outb/inb` instructions) in the part about linux kernel booting [process](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html). The second method is to map I/O physical addresses to virtual addresses. When a physical address is accessed by the CPU, it may refer to a portion of physical RAM which can be mapped on memory of the I/O device. So `ioremap` used to map device memory into kernel address space. We already saw first method (`outb/inb` instructions) in the part about linux kernel booting [process](https://0xax.gitbook.io/linux-insides/summary/booting/linux-bootstrap-3). The second method is to map I/O physical addresses to virtual addresses. When a physical address is accessed by the CPU, it may refer to a portion of physical RAM which can be mapped on memory of the I/O device. So `ioremap` used to map device memory into kernel address space.
As i wrote above next function is the `early_ioremap_init` which re-maps I/O memory to kernel address space so it can access it. We need to initialize early ioremap for early initialization code which needs to temporarily map I/O or memory regions before the normal mapping functions like `ioremap` are available. Implementation of this function is in the [arch/x86/mm/ioremap.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/mm/ioremap.c). At the start of the `early_ioremap_init` we can see definition of the `pmd` pointer with `pmd_t` type (which presents page middle directory entry `typedef struct { pmdval_t pmd; } pmd_t;` where `pmdval_t` is `unsigned long`) and make a check that `fixmap` aligned in a correct way: As i wrote above next function is the `early_ioremap_init` which re-maps I/O memory to kernel address space so it can access it. We need to initialize early ioremap for early initialization code which needs to temporarily map I/O or memory regions before the normal mapping functions like `ioremap` are available. Implementation of this function is in the [arch/x86/mm/ioremap.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/mm/ioremap.c). At the start of the `early_ioremap_init` we can see definition of the `pmd` pointer with `pmd_t` type (which presents page middle directory entry `typedef struct { pmdval_t pmd; } pmd_t;` where `pmdval_t` is `unsigned long`) and make a check that `fixmap` aligned in a correct way:
@ -235,7 +235,7 @@ After calculation we will get `0xfff` or 12 bits for `major` if it is `0xfffffff
Memory map setup Memory map setup
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
The next point is the setup of the memory map with the call of the `setup_memory_map` function. But before this we setup different parameters as information about a screen (current row and column, video page and etc... (you can read about it in the [Video mode initialization and transition to protected mode](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html))), Extended display identification data, video mode, bootloader_type and etc...: The next point is the setup of the memory map with the call of the `setup_memory_map` function. But before this we setup different parameters as information about a screen (current row and column, video page and etc... (you can read about it in the [Video mode initialization and transition to protected mode](https://0xax.gitbook.io/linux-insides/summary/booting/linux-bootstrap-3))), Extended display identification data, video mode, bootloader_type and etc...:
```C ```C
screen_info = boot_params.screen_info; screen_info = boot_params.screen_info;
@ -354,7 +354,7 @@ struct x86_init_ops x86_init __initdata = {
} }
``` ```
As we can see here `memory_setup` field is `default_machine_specific_memory_setup` where we get the number of the [e820](http://en.wikipedia.org/wiki/E820) entries which we collected in the [boot time](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html), sanitize the BIOS e820 map and fill `e820map` structure with the memory regions. As all regions are collected, print of all regions with printk. You can find this print if you execute `dmesg` command and you can see something like this: As we can see here `memory_setup` field is `default_machine_specific_memory_setup` where we get the number of the [e820](http://en.wikipedia.org/wiki/E820) entries which we collected in the [boot time](https://0xax.gitbook.io/linux-insides/summary/booting/linux-bootstrap-2), sanitize the BIOS e820 map and fill `e820map` structure with the memory regions. As all regions are collected, print of all regions with printk. You can find this print if you execute `dmesg` command and you can see something like this:
``` ```
[ 0.000000] e820: BIOS-provided physical RAM map: [ 0.000000] e820: BIOS-provided physical RAM map:
@ -408,7 +408,7 @@ static inline void __init copy_edd(void)
Memory descriptor initialization Memory descriptor initialization
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
The next step is initialization of the memory descriptor of the init process. As you already can know every process has its own address space. This address space presented with special data structure which called `memory descriptor`. Directly in the linux kernel source code memory descriptor presented with `mm_struct` structure. `mm_struct` contains many different fields related with the process address space as start/end address of the kernel code/data, start/end of the brk, number of memory areas, list of memory areas and etc... This structure defined in the [include/linux/mm_types.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/mm_types.h). As every process has its own memory descriptor, `task_struct` structure contains it in the `mm` and `active_mm` field. And our first `init` process has it too. You can remember that we saw the part of initialization of the init `task_struct` with `INIT_TASK` macro in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html): The next step is initialization of the memory descriptor of the init process. As you already can know every process has its own address space. This address space presented with special data structure which called `memory descriptor`. Directly in the linux kernel source code memory descriptor presented with `mm_struct` structure. `mm_struct` contains many different fields related with the process address space as start/end address of the kernel code/data, start/end of the brk, number of memory areas, list of memory areas and etc... This structure defined in the [include/linux/mm_types.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/mm_types.h). As every process has its own memory descriptor, `task_struct` structure contains it in the `mm` and `active_mm` field. And our first `init` process has it too. You can remember that we saw the part of initialization of the init `task_struct` with `INIT_TASK` macro in the previous [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-4):
```C ```C
#define INIT_TASK(tsk) \ #define INIT_TASK(tsk) \
@ -511,4 +511,4 @@ Links
* [CFI directives](https://sourceware.org/binutils/docs/as/CFI-directives.html) * [CFI directives](https://sourceware.org/binutils/docs/as/CFI-directives.html)
* [PDF. dwarf4 specification](http://dwarfstd.org/doc/DWARF4.pdf) * [PDF. dwarf4 specification](http://dwarfstd.org/doc/DWARF4.pdf)
* [Call stack](http://en.wikipedia.org/wiki/Call_stack) * [Call stack](http://en.wikipedia.org/wiki/Call_stack)
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) * [Previous part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-4)

View File

@ -4,7 +4,7 @@ Kernel initialization. Part 6.
Architecture-specific initialization, again... Architecture-specific initialization, again...
================================================================================ ================================================================================
In the previous [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) we saw architecture-specific (`x86_64` in our case) initialization stuff from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c) and finished on `x86_configure_nx` function which sets the `_PAGE_NX` flag depends on support of [NX bit](http://en.wikipedia.org/wiki/NX_bit). As I wrote before `setup_arch` function and `start_kernel` are very big, so in this and in the next part we will continue to learn about architecture-specific initialization process. The next function after `x86_configure_nx` is `parse_early_param`. This function is defined in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) and as you can understand from its name, this function parses kernel command line and setups different services depends on the given parameters (all kernel command line parameters you can find are in the [Documentation/kernel-parameters.txt](https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/kernel-parameters.rst)). You may remember how we setup `earlyprintk` in the earliest [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html). On the early stage we looked for kernel parameters and their value with the `cmdline_find_option` function and `__cmdline_find_option`, `__cmdline_find_option_bool` helpers from the [arch/x86/boot/cmdline.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/cmdline.c). There we're in the generic kernel part which does not depend on architecture and here we use another approach. If you are reading linux kernel source code, you already note calls like this: In the previous [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-5) we saw architecture-specific (`x86_64` in our case) initialization stuff from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c) and finished on `x86_configure_nx` function which sets the `_PAGE_NX` flag depends on support of [NX bit](http://en.wikipedia.org/wiki/NX_bit). As I wrote before `setup_arch` function and `start_kernel` are very big, so in this and in the next part we will continue to learn about architecture-specific initialization process. The next function after `x86_configure_nx` is `parse_early_param`. This function is defined in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) and as you can understand from its name, this function parses kernel command line and setups different services depends on the given parameters (all kernel command line parameters you can find are in the [Documentation/kernel-parameters.txt](https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/kernel-parameters.rst)). You may remember how we setup `earlyprintk` in the earliest [part](https://0xax.gitbook.io/linux-insides/summary/booting/linux-bootstrap-2). On the early stage we looked for kernel parameters and their value with the `cmdline_find_option` function and `__cmdline_find_option`, `__cmdline_find_option_bool` helpers from the [arch/x86/boot/cmdline.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/cmdline.c). There we're in the generic kernel part which does not depend on architecture and here we use another approach. If you are reading linux kernel source code, you already note calls like this:
```C ```C
early_param("gbpages", parse_direct_gbpages_on); early_param("gbpages", parse_direct_gbpages_on);
@ -97,7 +97,7 @@ After this we can see call of the:
memblock_x86_reserve_range_setup_data(); memblock_x86_reserve_range_setup_data();
``` ```
function. This function is defined in the same [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c) source code file and remaps memory for the `setup_data` and reserved memory block for the `setup_data` (more about `setup_data` you can read in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) and about `ioremap` and `memblock` you can read in the [Linux kernel memory management](https://0xax.gitbooks.io/linux-insides/content/MM/index.html)). function. This function is defined in the same [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c) source code file and remaps memory for the `setup_data` and reserved memory block for the `setup_data` (more about `setup_data` you can read in the previous [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-5) and about `ioremap` and `memblock` you can read in the [Linux kernel memory management](https://0xax.gitbook.io/linux-insides/summary/mm)).
In the next step we can see following conditional statement: In the next step we can see following conditional statement:
@ -128,7 +128,7 @@ int __init acpi_mps_check(void)
} }
``` ```
It checks the built-in `MPS` or [MultiProcessor Specification](http://en.wikipedia.org/wiki/MultiProcessor_Specification) table. If `CONFIG_X86_LOCAL_APIC` is set and `CONFIG_x86_MPPAARSE` is not set, `acpi_mps_check` prints warning message if the one of the command line options: `acpi=off`, `acpi=noirq` or `pci=noacpi` passed to the kernel. If `acpi_mps_check` returns `1` it means that we disable local [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) and clear `X86_FEATURE_APIC` bit in the of the current CPU with the `setup_clear_cpu_cap` macro. (more about CPU mask you can read in the [CPU masks](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html)). It checks the built-in `MPS` or [MultiProcessor Specification](http://en.wikipedia.org/wiki/MultiProcessor_Specification) table. If `CONFIG_X86_LOCAL_APIC` is set and `CONFIG_x86_MPPAARSE` is not set, `acpi_mps_check` prints warning message if the one of the command line options: `acpi=off`, `acpi=noirq` or `pci=noacpi` passed to the kernel. If `acpi_mps_check` returns `1` it means that we disable local [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) and clear `X86_FEATURE_APIC` bit in the of the current CPU with the `setup_clear_cpu_cap` macro. (more about CPU mask you can read in the [CPU masks](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-2)).
Early PCI dump Early PCI dump
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
@ -193,7 +193,7 @@ That's all. We will not go deep in the `pci` details, but will see more details
Finish with memory parsing Finish with memory parsing
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
After the `early_dump_pci_devices`, there are a couple of function related with available memory and [e820](http://en.wikipedia.org/wiki/E820) which we collected in the [First steps in the kernel setup](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) part: After the `early_dump_pci_devices`, there are a couple of function related with available memory and [e820](http://en.wikipedia.org/wiki/E820) which we collected in the [First steps in the kernel setup](https://0xax.gitbook.io/linux-insides/summary/booting/linux-bootstrap-2) part:
```C ```C
/* update the e820_saved too */ /* update the e820_saved too */
@ -535,8 +535,8 @@ Links
* [NX bit](http://en.wikipedia.org/wiki/NX_bit) * [NX bit](http://en.wikipedia.org/wiki/NX_bit)
* [Documentation/kernel-parameters.txt](https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/kernel-parameters.rst) * [Documentation/kernel-parameters.txt](https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/kernel-parameters.rst)
* [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) * [APIC](http://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)
* [CPU masks](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) * [CPU masks](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-2)
* [Linux kernel memory management](https://0xax.gitbooks.io/linux-insides/content/MM/index.html) * [Linux kernel memory management](https://0xax.gitbook.io/linux-insides/summary/mm)
* [PCI](http://en.wikipedia.org/wiki/Conventional_PCI) * [PCI](http://en.wikipedia.org/wiki/Conventional_PCI)
* [e820](http://en.wikipedia.org/wiki/E820) * [e820](http://en.wikipedia.org/wiki/E820)
* [System Management BIOS](http://en.wikipedia.org/wiki/System_Management_BIOS) * [System Management BIOS](http://en.wikipedia.org/wiki/System_Management_BIOS)
@ -546,4 +546,4 @@ Links
* [MultiProcessor Specification](http://www.intel.com/design/pentium/datashts/24201606.pdf) * [MultiProcessor Specification](http://www.intel.com/design/pentium/datashts/24201606.pdf)
* [BSS](http://en.wikipedia.org/wiki/.bss) * [BSS](http://en.wikipedia.org/wiki/.bss)
* [SMBIOS specification](http://www.dmtf.org/sites/default/files/standards/documents/DSP0134v2.5Final.pdf) * [SMBIOS specification](http://www.dmtf.org/sites/default/files/standards/documents/DSP0134v2.5Final.pdf)
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) * [Previous part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-5)

View File

@ -4,7 +4,7 @@ Kernel initialization. Part 7.
The End of the architecture-specific initialization, almost... The End of the architecture-specific initialization, almost...
================================================================================ ================================================================================
This is the seventh part of the Linux Kernel initialization process which covers insides of the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c#L861). As you can know from the previous [parts](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html), the `setup_arch` function does some architecture-specific (in our case it is [x86_64](http://en.wikipedia.org/wiki/X86-64)) initialization stuff like reserving memory for kernel code/data/bss, early scanning of the [Desktop Management Interface](http://en.wikipedia.org/wiki/Desktop_Management_Interface), early dump of the [PCI](http://en.wikipedia.org/wiki/PCI) device and many many more. If you have read the previous [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html), you can remember that we've finished it at the `setup_real_mode` function. In the next step, as we set limit of the [memblock](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-1.html) to the all mapped pages, we can see the call of the `setup_log_buf` function from the [kernel/printk/printk.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/printk/printk.c). This is the seventh part of the Linux Kernel initialization process which covers insides of the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c#L861). As you can know from the previous [parts](https://0xax.gitbook.io/linux-insides/summary/initialization), the `setup_arch` function does some architecture-specific (in our case it is [x86_64](http://en.wikipedia.org/wiki/X86-64)) initialization stuff like reserving memory for kernel code/data/bss, early scanning of the [Desktop Management Interface](http://en.wikipedia.org/wiki/Desktop_Management_Interface), early dump of the [PCI](http://en.wikipedia.org/wiki/PCI) device and many many more. If you have read the previous [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-6), you can remember that we've finished it at the `setup_real_mode` function. In the next step, as we set limit of the [memblock](https://0xax.gitbook.io/linux-insides/summary/mm/linux-mm-1) to the all mapped pages, we can see the call of the `setup_log_buf` function from the [kernel/printk/printk.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/printk/printk.c).
The `setup_log_buf` function setups kernel cyclic buffer and its length depends on the `CONFIG_LOG_BUF_SHIFT` configuration option. As we can read from the documentation of the `CONFIG_LOG_BUF_SHIFT` it can be between `12` and `21`. In the insides, buffer defined as array of chars: The `setup_log_buf` function setups kernel cyclic buffer and its length depends on the `CONFIG_LOG_BUF_SHIFT` configuration option. As we can read from the documentation of the `CONFIG_LOG_BUF_SHIFT` it can be between `12` and `21`. In the insides, buffer defined as array of chars:
@ -32,7 +32,7 @@ setup_log_buf(1);
where `1` means that it is early setup. In the next step we check `new_log_buf_len` variable which is updated length of the kernel log buffer and allocate new space for the buffer with the `memblock_virt_alloc` function for it, or just return. where `1` means that it is early setup. In the next step we check `new_log_buf_len` variable which is updated length of the kernel log buffer and allocate new space for the buffer with the `memblock_virt_alloc` function for it, or just return.
As kernel log buffer is ready, the next function is `reserve_initrd`. You can remember that we already called the `early_reserve_initrd` function in the fourth part of the [Kernel initialization](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html). Now, as we reconstructed direct memory mapping in the `init_mem_mapping` function, we need to move [initrd](http://en.wikipedia.org/wiki/Initrd) into directly mapped memory. The `reserve_initrd` function starts from the definition of the base address and end address of the `initrd` and check that `initrd` is provided by a bootloader. All the same as what we saw in the `early_reserve_initrd`. But instead of the reserving place in the `memblock` area with the call of the `memblock_reserve` function, we get the mapped size of the direct memory area and check that the size of the `initrd` is not greater than this area with: As kernel log buffer is ready, the next function is `reserve_initrd`. You can remember that we already called the `early_reserve_initrd` function in the fourth part of the [Kernel initialization](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-4). Now, as we reconstructed direct memory mapping in the `init_mem_mapping` function, we need to move [initrd](http://en.wikipedia.org/wiki/Initrd) into directly mapped memory. The `reserve_initrd` function starts from the definition of the base address and end address of the `initrd` and check that `initrd` is provided by a bootloader. All the same as what we saw in the `early_reserve_initrd`. But instead of the reserving place in the `memblock` area with the call of the `memblock_reserve` function, we get the mapped size of the direct memory area and check that the size of the `initrd` is not greater than this area with:
```C ```C
mapped_size = memblock_mem_size(max_pfn_mapped); mapped_size = memblock_mem_size(max_pfn_mapped);
@ -68,7 +68,7 @@ memblock_free(ramdisk_image, ramdisk_end - ramdisk_image);
After we relocated `initrd` ramdisk image, the next function is `vsmp_init` from the [arch/x86/kernel/vsmp_64.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/vsmp_64.c). This function initializes support of the `ScaleMP vSMP`. As I already wrote in the previous parts, this chapter will not cover non-related `x86_64` initialization parts (for example as the current or `ACPI`, etc.). So we will skip implementation of this for now and will back to it in the part which cover techniques of parallel computing. After we relocated `initrd` ramdisk image, the next function is `vsmp_init` from the [arch/x86/kernel/vsmp_64.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/vsmp_64.c). This function initializes support of the `ScaleMP vSMP`. As I already wrote in the previous parts, this chapter will not cover non-related `x86_64` initialization parts (for example as the current or `ACPI`, etc.). So we will skip implementation of this for now and will back to it in the part which cover techniques of parallel computing.
The next function is `io_delay_init` from the [arch/x86/kernel/io_delay.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/io_delay.c). This function allows to override default I/O delay `0x80` port. We already saw I/O delay in the [Last preparation before transition into protected mode](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html), now let's look on the `io_delay_init` implementation: The next function is `io_delay_init` from the [arch/x86/kernel/io_delay.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/io_delay.c). This function allows to override default I/O delay `0x80` port. We already saw I/O delay in the [Last preparation before transition into protected mode](https://0xax.gitbook.io/linux-insides/summary/booting/linux-bootstrap-3), now let's look on the `io_delay_init` implementation:
```C ```C
void __init io_delay_init(void) void __init io_delay_init(void)
@ -98,7 +98,7 @@ We can see `io_delay` command line parameter setup with the `early_param` macro
early_param("io_delay", io_delay_param); early_param("io_delay", io_delay_param);
``` ```
More about `early_param` you can read in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html). So the `io_delay_param` function which setups `io_delay_override` variable will be called in the [do_early_param](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c#L413) function. `io_delay_param` function gets the argument of the `io_delay` kernel command line parameter and sets `io_delay_type` depends on it: More about `early_param` you can read in the previous [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-6). So the `io_delay_param` function which setups `io_delay_override` variable will be called in the [do_early_param](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c#L413) function. `io_delay_param` function gets the argument of the `io_delay` kernel command line parameter and sets `io_delay_type` depends on it:
```C ```C
static int __init io_delay_param(char *s) static int __init io_delay_param(char *s)
@ -296,19 +296,19 @@ BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
(unsigned long)VSYSCALL_ADDR); (unsigned long)VSYSCALL_ADDR);
``` ```
Now `vsyscall` area is in the `fix-mapped` area. That's all about `map_vsyscall`, if you do not know anything about fix-mapped addresses, you can read [Fix-Mapped Addresses and ioremap](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-2.html). We will see more about `vsyscalls` in the `vsyscalls and vdso` part. Now `vsyscall` area is in the `fix-mapped` area. That's all about `map_vsyscall`, if you do not know anything about fix-mapped addresses, you can read [Fix-Mapped Addresses and ioremap](https://0xax.gitbook.io/linux-insides/summary/mm/linux-mm-2). We will see more about `vsyscalls` in the `vsyscalls and vdso` part.
Getting the SMP configuration Getting the SMP configuration
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
You may remember how we made a search of the [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing) configuration in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html). Now we need to get the `SMP` configuration if we found it. For this we check `smp_found_config` variable which we set in the `smp_scan_config` function (read about it the previous part) and call the `get_smp_config` function: You may remember how we made a search of the [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing) configuration in the previous [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-6). Now we need to get the `SMP` configuration if we found it. For this we check `smp_found_config` variable which we set in the `smp_scan_config` function (read about it the previous part) and call the `get_smp_config` function:
```C ```C
if (smp_found_config) if (smp_found_config)
get_smp_config(); get_smp_config();
``` ```
The `get_smp_config` expands to the `x86_init.mpparse.default_get_smp_config` function which is defined in the [arch/x86/kernel/mpparse.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/mpparse.c). This function defines a pointer to the multiprocessor floating pointer structure - `mpf_intel` (you can read about it in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html)) and does some checks: The `get_smp_config` expands to the `x86_init.mpparse.default_get_smp_config` function which is defined in the [arch/x86/kernel/mpparse.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/mpparse.c). This function defines a pointer to the multiprocessor floating pointer structure - `mpf_intel` (you can read about it in the previous [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-6)) and does some checks:
```C ```C
struct mpf_intel *mpf = mpf_found; struct mpf_intel *mpf = mpf_found;
@ -320,7 +320,7 @@ if (acpi_lapic && early)
return; return;
``` ```
Here we can see that multiprocessor configuration was found in the `smp_scan_config` function or just return from the function if not. The next check is `acpi_lapic` and `early`. And as we did this checks, we start to read the `SMP` configuration. As we finished reading it, the next step is - `prefill_possible_map` function which makes preliminary filling of the possible CPU's `cpumask` (more about it you can read in the [Introduction to the cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html)). Here we can see that multiprocessor configuration was found in the `smp_scan_config` function or just return from the function if not. The next check is `acpi_lapic` and `early`. And as we did this checks, we start to read the `SMP` configuration. As we finished reading it, the next step is - `prefill_possible_map` function which makes preliminary filling of the possible CPU's `cpumask` (more about it you can read in the [Introduction to the cpumasks](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-2)).
The rest of the setup_arch The rest of the setup_arch
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
@ -334,7 +334,7 @@ That's all, and now we can back to the `start_kernel` from the `setup_arch`.
Back to the main.c Back to the main.c
================================================================================ ================================================================================
As I wrote above, we have finished with the `setup_arch` function and now we can back to the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c). As you may remember or saw yourself, `start_kernel` function as big as the `setup_arch`. So the couple of the next part will be dedicated to learning of this function. So, let's continue with it. After the `setup_arch` we can see the call of the `mm_init_cpumask` function. This function sets the [cpumask](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) pointer to the memory descriptor `cpumask`. We can look on its implementation: As I wrote above, we have finished with the `setup_arch` function and now we can back to the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c). As you may remember or saw yourself, `start_kernel` function as big as the `setup_arch`. So the couple of the next part will be dedicated to learning of this function. So, let's continue with it. After the `setup_arch` we can see the call of the `mm_init_cpumask` function. This function sets the [cpumask](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-2) pointer to the memory descriptor `cpumask`. We can look on its implementation:
```C ```C
static inline void mm_init_cpumask(struct mm_struct *mm) static inline void mm_init_cpumask(struct mm_struct *mm)
@ -379,7 +379,7 @@ static void __init setup_command_line(char *command_line)
Here we can see that we allocate space for the three buffers which will contain kernel command line for the different purposes (read above). And as we allocated space, we store `boot_command_line` in the `saved_command_line` and `command_line` (kernel command line from the `setup_arch`) to the `static_command_line`. Here we can see that we allocate space for the three buffers which will contain kernel command line for the different purposes (read above). And as we allocated space, we store `boot_command_line` in the `saved_command_line` and `command_line` (kernel command line from the `setup_arch`) to the `static_command_line`.
The next function after the `setup_command_line` is the `setup_nr_cpu_ids`. This function setting `nr_cpu_ids` (number of CPUs) according to the last bit in the `cpu_possible_mask` (more about it you can read in the chapter describes [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) concept). Let's look on its implementation: The next function after the `setup_command_line` is the `setup_nr_cpu_ids`. This function setting `nr_cpu_ids` (number of CPUs) according to the last bit in the `cpu_possible_mask` (more about it you can read in the chapter describes [cpumasks](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-2) concept). Let's look on its implementation:
```C ```C
void __init setup_nr_cpu_ids(void) void __init setup_nr_cpu_ids(void)
@ -479,4 +479,4 @@ Links
* [vsyscalls](https://lwn.net/Articles/446528/) * [vsyscalls](https://lwn.net/Articles/446528/)
* [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing) * [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing)
* [jiffy](http://en.wikipedia.org/wiki/Jiffy_%28time%29) * [jiffy](http://en.wikipedia.org/wiki/Jiffy_%28time%29)
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) * [Previous part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-6)

View File

@ -4,9 +4,9 @@ Kernel initialization. Part 8.
Scheduler initialization Scheduler initialization
================================================================================ ================================================================================
This is the eighth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) of the Linux kernel initialization process chapter and we stopped on the `setup_nr_cpu_ids` function in the [previous part](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-7.md). This is the eighth [part](https://0xax.gitbook.io/linux-insides/summary/initialization) of the Linux kernel initialization process chapter and we stopped on the `setup_nr_cpu_ids` function in the [previous part](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-7.md).
The main point of this part is [scheduler](http://en.wikipedia.org/wiki/Scheduling_%28computing%29) initialization. But before we will start to learn initialization process of the scheduler, we need to do some stuff. The next step in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) is the `setup_per_cpu_areas` function. This function setups memory areas for the `percpu` variables, more about it you can read in the special part about the [Per-CPU variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html). After `percpu` areas is up and running, the next step is the `smp_prepare_boot_cpu` function. The main point of this part is [scheduler](http://en.wikipedia.org/wiki/Scheduling_%28computing%29) initialization. But before we will start to learn initialization process of the scheduler, we need to do some stuff. The next step in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) is the `setup_per_cpu_areas` function. This function setups memory areas for the `percpu` variables, more about it you can read in the special part about the [Per-CPU variables](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1). After `percpu` areas is up and running, the next step is the `smp_prepare_boot_cpu` function.
This function does some preparations for [symmetric multiprocessing](http://en.wikipedia.org/wiki/Symmetric_multiprocessing). Since this function is architecture specific, it is located in the [arch/x86/include/asm/smp.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/smp.h#L78) Linux kernel header file. Let's look at the definition of this function: This function does some preparations for [symmetric multiprocessing](http://en.wikipedia.org/wiki/Symmetric_multiprocessing). Since this function is architecture specific, it is located in the [arch/x86/include/asm/smp.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/smp.h#L78) Linux kernel header file. Let's look at the definition of this function:
@ -44,7 +44,7 @@ void __init native_smp_prepare_boot_cpu(void)
} }
``` ```
and executes following things: first of all it gets the `id` of the current CPU (which is Bootstrap processor and its `id` is zero for this moment) with the `smp_processor_id` function. I will not explain how the `smp_processor_id` works, because we already saw it in the [Kernel entry point](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) part. After we've got processor `id` number we reload [Global Descriptor Table](http://en.wikipedia.org/wiki/Global_Descriptor_Table) for the given CPU with the `switch_to_new_gdt` function: and executes following things: first of all it gets the `id` of the current CPU (which is Bootstrap processor and its `id` is zero for this moment) with the `smp_processor_id` function. I will not explain how the `smp_processor_id` works, because we already saw it in the [Kernel entry point](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-4) part. After we've got processor `id` number we reload [Global Descriptor Table](http://en.wikipedia.org/wiki/Global_Descriptor_Table) for the given CPU with the `switch_to_new_gdt` function:
```C ```C
void switch_to_new_gdt(int cpu) void switch_to_new_gdt(int cpu)
@ -58,7 +58,7 @@ void switch_to_new_gdt(int cpu)
} }
``` ```
The `gdt_descr` variable represents pointer to the `GDT` descriptor here (we already saw definition of a `desc_ptr` structure in the [Early interrupt and exception handling](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) part). We get the address and the size of the `GDT` descriptor for the `CPU` with the given `id`. The `GDT_SIZE` is `256` or: The `gdt_descr` variable represents pointer to the `GDT` descriptor here (we already saw definition of a `desc_ptr` structure in the [Early interrupt and exception handling](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-2) part). We get the address and the size of the `GDT` descriptor for the `CPU` with the given `id`. The `GDT_SIZE` is `256` or:
```C ```C
#define GDT_SIZE (GDT_ENTRIES * 8) #define GDT_SIZE (GDT_ENTRIES * 8)
@ -75,7 +75,7 @@ static inline struct desc_struct *get_cpu_gdt_table(unsigned int cpu)
The `get_cpu_gdt_table` uses `per_cpu` macro for getting value of a `gdt_page` percpu variable for the given CPU number (bootstrap processor with `id` - 0 in our case). The `get_cpu_gdt_table` uses `per_cpu` macro for getting value of a `gdt_page` percpu variable for the given CPU number (bootstrap processor with `id` - 0 in our case).
You may ask the following question: so, if we can access `gdt_page` percpu variable, where it was defined? Actually we already saw it in this book. If you have read the first [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of this chapter, you can remember that we saw definition of the `gdt_page` in the [arch/x86/kernel/head_64.S](https://github.com/0xAX/linux/blob/0a07b238e5f488b459b6113a62e06b6aab017f71/arch/x86/kernel/head_64.S): You may ask the following question: so, if we can access `gdt_page` percpu variable, where it was defined? Actually we already saw it in this book. If you have read the first [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-1) of this chapter, you can remember that we saw definition of the `gdt_page` in the [arch/x86/kernel/head_64.S](https://github.com/0xAX/linux/blob/0a07b238e5f488b459b6113a62e06b6aab017f71/arch/x86/kernel/head_64.S):
```assembly ```assembly
early_gdt_descr: early_gdt_descr:
@ -107,7 +107,7 @@ DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
... ...
``` ```
more about `percpu` variables you can read in the [Per-CPU variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) part. As we got address and size of the `GDT` descriptor we reload `GDT` with the `load_gdt` which just execute `lgdt` instruct and load `percpu_segment` with the following function: more about `percpu` variables you can read in the [Per-CPU variables](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1) part. As we got address and size of the `GDT` descriptor we reload `GDT` with the `load_gdt` which just execute `lgdt` instruct and load `percpu_segment` with the following function:
```C ```C
void load_percpu_segment(int cpu) { void load_percpu_segment(int cpu) {
@ -205,11 +205,11 @@ After this function we can see the kernel command line in the initialization out
![kernel command line](images/kernel_command_line.png) ![kernel command line](images/kernel_command_line.png)
And a couple of functions such as `parse_early_param` and `parse_args` which handles linux kernel command line. You may remember that we already saw the call of the `parse_early_param` function in the sixth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the kernel initialization chapter, so why we call it again? Answer is simple: we call this function in the architecture-specific code (`x86_64` in our case), but not all architecture calls this function. And we need to call the second function `parse_args` to parse and handle non-early command line arguments. And a couple of functions such as `parse_early_param` and `parse_args` which handles linux kernel command line. You may remember that we already saw the call of the `parse_early_param` function in the sixth [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-6) of the kernel initialization chapter, so why we call it again? Answer is simple: we call this function in the architecture-specific code (`x86_64` in our case), but not all architecture calls this function. And we need to call the second function `parse_args` to parse and handle non-early command line arguments.
In the next step we can see the call of the `jump_label_init` from the [kernel/jump_label.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/jump_label.c). and initializes [jump label](https://lwn.net/Articles/412072/). In the next step we can see the call of the `jump_label_init` from the [kernel/jump_label.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/jump_label.c). and initializes [jump label](https://lwn.net/Articles/412072/).
After this we can see the call of the `setup_log_buf` function which setups the [printk](http://www.makelinux.net/books/lkd2/ch18lev1sec3) log buffer. We already saw this function in the seventh [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-7.html) of the linux kernel initialization process chapter. After this we can see the call of the `setup_log_buf` function which setups the [printk](http://www.makelinux.net/books/lkd2/ch18lev1sec3) log buffer. We already saw this function in the seventh [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-7) of the linux kernel initialization process chapter.
PID hash initialization PID hash initialization
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
@ -230,7 +230,7 @@ pid_hash = alloc_large_system_hash("PID", sizeof(*pid_hash), 0, 18,
``` ```
The number of elements of the `pid_hash` depends on the `RAM` configuration, but it can be between `2^4` and `2^12`. The `pidhash_init` computes the size The number of elements of the `pid_hash` depends on the `RAM` configuration, but it can be between `2^4` and `2^12`. The `pidhash_init` computes the size
and allocates the required storage (which is `hlist` in our case - the same as [doubly linked list](https://0xax.gitbooks.io/linux-insides/content/DataStructures/linux-datastructures-1.html), but contains one pointer instead on the [struct hlist_head](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/types.h)]. The `alloc_large_system_hash` function allocates a large system hash table with `memblock_virt_alloc_nopanic` if we pass `HASH_EARLY` flag (as it in our case) or with `__vmalloc` if we did no pass this flag. and allocates the required storage (which is `hlist` in our case - the same as [doubly linked list](https://0xax.gitbook.io/linux-insides/summary/datastructures/linux-datastructures-1), but contains one pointer instead on the [struct hlist_head](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/types.h)]. The `alloc_large_system_hash` function allocates a large system hash table with `memblock_virt_alloc_nopanic` if we pass `HASH_EARLY` flag (as it in our case) or with `__vmalloc` if we did no pass this flag.
The result we can see in the `dmesg` output: The result we can see in the `dmesg` output:
@ -255,7 +255,7 @@ pgtable_init();
vmalloc_init(); vmalloc_init();
``` ```
The first is `page_ext_init_flatmem` which depends on the `CONFIG_SPARSEMEM` kernel configuration option and initializes extended data per page handling. The `mem_init` releases all `bootmem`, the `kmem_cache_init` initializes kernel cache, the `percpu_init_late` - replaces `percpu` chunks with those allocated by [slub](http://en.wikipedia.org/wiki/SLUB_%28software%29), the `pgtable_init` - initializes the `page->ptl` kernel cache, the `vmalloc_init` - initializes `vmalloc`. Please, **NOTE** that we will not dive into details about all of these functions and concepts, but we will see all of they it in the [Linux kernel memory manager](https://0xax.gitbooks.io/linux-insides/content/MM/index.html) chapter. The first is `page_ext_init_flatmem` which depends on the `CONFIG_SPARSEMEM` kernel configuration option and initializes extended data per page handling. The `mem_init` releases all `bootmem`, the `kmem_cache_init` initializes kernel cache, the `percpu_init_late` - replaces `percpu` chunks with those allocated by [slub](http://en.wikipedia.org/wiki/SLUB_%28software%29), the `pgtable_init` - initializes the `page->ptl` kernel cache, the `vmalloc_init` - initializes `vmalloc`. Please, **NOTE** that we will not dive into details about all of these functions and concepts, but we will see all of they it in the [Linux kernel memory manager](https://0xax.gitbook.io/linux-insides/summary/mm) chapter.
That's all. Now we can look on the `scheduler`. That's all. Now we can look on the `scheduler`.
@ -533,7 +533,7 @@ __init void init_sched_fair_class(void)
} }
``` ```
Here we register a [soft irq](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-9.html) that will call the `run_rebalance_domains` handler. After the `SCHED_SOFTIRQ` will be triggered, the `run_rebalance` will be called to rebalance a run queue on the current CPU. Here we register a [soft irq](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-9) that will call the `run_rebalance_domains` handler. After the `SCHED_SOFTIRQ` will be triggered, the `run_rebalance` will be called to rebalance a run queue on the current CPU.
The last two steps of the `sched_init` function is to initialization of scheduler statistics and setting `scheeduler_running` variable: The last two steps of the `sched_init` function is to initialization of scheduler statistics and setting `scheeduler_running` variable:
@ -555,19 +555,19 @@ If you have any questions or suggestions write me a comment or ping me at [twitt
Links Links
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
* [CPU masks](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) * [CPU masks](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-2)
* [high-resolution kernel timer](https://www.kernel.org/doc/Documentation/timers/hrtimers.txt) * [high-resolution kernel timer](https://www.kernel.org/doc/Documentation/timers/hrtimers.txt)
* [spinlock](http://en.wikipedia.org/wiki/Spinlock) * [spinlock](http://en.wikipedia.org/wiki/Spinlock)
* [Run queue](http://en.wikipedia.org/wiki/Run_queue) * [Run queue](http://en.wikipedia.org/wiki/Run_queue)
* [Linux kernel memory manager](https://0xax.gitbooks.io/linux-insides/content/MM/index.html) * [Linux kernel memory manager](https://0xax.gitbook.io/linux-insides/summary/mm)
* [slub](http://en.wikipedia.org/wiki/SLUB_%28software%29) * [slub](http://en.wikipedia.org/wiki/SLUB_%28software%29)
* [virtual file system](http://en.wikipedia.org/wiki/Virtual_file_system) * [virtual file system](http://en.wikipedia.org/wiki/Virtual_file_system)
* [Linux kernel hotplug documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) * [Linux kernel hotplug documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)
* [IRQ](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) * [IRQ](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
* [Global Descriptor Table](http://en.wikipedia.org/wiki/Global_Descriptor_Table) * [Global Descriptor Table](http://en.wikipedia.org/wiki/Global_Descriptor_Table)
* [Per-CPU variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) * [Per-CPU variables](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1)
* [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing) * [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing)
* [RCU](http://en.wikipedia.org/wiki/Read-copy-update) * [RCU](http://en.wikipedia.org/wiki/Read-copy-update)
* [CFS Scheduler documentation](https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt) * [CFS Scheduler documentation](https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt)
* [Real-Time group scheduling](https://www.kernel.org/doc/Documentation/scheduler/sched-rt-group.txt) * [Real-Time group scheduling](https://www.kernel.org/doc/Documentation/scheduler/sched-rt-group.txt)
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-7.html) * [Previous part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-7)

View File

@ -4,7 +4,7 @@ Kernel initialization. Part 9.
RCU initialization RCU initialization
================================================================================ ================================================================================
This is ninth part of the [Linux Kernel initialization process](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and in the previous part we stopped at the [scheduler initialization](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-8.html). In this part we will continue to dive to the linux kernel initialization process and the main purpose of this part will be to learn about initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update). We can see that the next step in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) after the `sched_init` is the call of the `preempt_disable`. There are two macros: This is ninth part of the [Linux Kernel initialization process](https://0xax.gitbook.io/linux-insides/summary/initialization) and in the previous part we stopped at the [scheduler initialization](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-8). In this part we will continue to dive to the linux kernel initialization process and the main purpose of this part will be to learn about initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update). We can see that the next step in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) after the `sched_init` is the call of the `preempt_disable`. There are two macros:
* `preempt_disable` * `preempt_disable`
* `preempt_enable` * `preempt_enable`
@ -38,7 +38,7 @@ In the first implementation of the `preempt_disable` we increment this `__preemp
#define preempt_count_add(val) __preempt_count_add(val) #define preempt_count_add(val) __preempt_count_add(val)
``` ```
where `preempt_count_add` calls the `raw_cpu_add_4` macro which adds `1` to the given `percpu` variable (`__preempt_count`) in our case (more about `precpu` variables you can read in the part about [Per-CPU variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html)). Ok, we increased `__preempt_count` and the next step we can see the call of the `barrier` macro in the both macros. The `barrier` macro inserts an optimization barrier. In the processors with `x86_64` architecture independent memory access operations can be performed in any order. That's why we need the opportunity to point compiler and processor on compliance of order. This mechanism is memory barrier. Let's consider a simple example: where `preempt_count_add` calls the `raw_cpu_add_4` macro which adds `1` to the given `percpu` variable (`__preempt_count`) in our case (more about `precpu` variables you can read in the part about [Per-CPU variables](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1)). Ok, we increased `__preempt_count` and the next step we can see the call of the `barrier` macro in the both macros. The `barrier` macro inserts an optimization barrier. In the processors with `x86_64` architecture independent memory access operations can be performed in any order. That's why we need the opportunity to point compiler and processor on compliance of order. This mechanism is memory barrier. Let's consider a simple example:
```C ```C
preempt_disable(); preempt_disable();
@ -83,7 +83,7 @@ void __init idr_init_cache(void)
} }
``` ```
Here we can see the call of the `kmem_cache_create`. We already called the `kmem_cache_init` in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c#L485). This function create generalized caches again using the `kmem_cache_alloc` (more about caches we will see in the [Linux kernel memory management](https://0xax.gitbooks.io/linux-insides/content/MM/index.html) chapter). In our case, as we are using `kmem_cache_t` which will be used by the [slab](http://en.wikipedia.org/wiki/Slab_allocation) allocator and `kmem_cache_create` creates it. As you can see we pass five parameters to the `kmem_cache_create`: Here we can see the call of the `kmem_cache_create`. We already called the `kmem_cache_init` in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c#L485). This function create generalized caches again using the `kmem_cache_alloc` (more about caches we will see in the [Linux kernel memory management](https://0xax.gitbook.io/linux-insides/summary/mm) chapter). In our case, as we are using `kmem_cache_t` which will be used by the [slab](http://en.wikipedia.org/wiki/Slab_allocation) allocator and `kmem_cache_create` creates it. As you can see we pass five parameters to the `kmem_cache_create`:
* name of the cache; * name of the cache;
* size of the object to store in cache; * size of the object to store in cache;
@ -127,7 +127,7 @@ The next step is [RCU](http://en.wikipedia.org/wiki/Read-copy-update) initializa
In the first case `rcu_init` will be in the [kernel/rcu/tiny.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/rcu/tiny.c) and in the second case it will be defined in the [kernel/rcu/tree.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/rcu/tree.c). We will see the implementation of the `tree rcu`, but first of all about the `RCU` in general. In the first case `rcu_init` will be in the [kernel/rcu/tiny.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/rcu/tiny.c) and in the second case it will be defined in the [kernel/rcu/tree.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/rcu/tree.c). We will see the implementation of the `tree rcu`, but first of all about the `RCU` in general.
`RCU` or read-copy update is a scalable high-performance synchronization mechanism implemented in the Linux kernel. On the early stage the linux kernel provided support and environment for the concurrently running applications, but all execution was serialized in the kernel using a single global lock. In our days linux kernel has no single global lock, but provides different mechanisms including [lock-free data structures](http://en.wikipedia.org/wiki/Concurrent_data_structure), [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) data structures and other. One of these mechanisms is - the `read-copy update`. The `RCU` technique is designed for rarely-modified data structures. The idea of the `RCU` is simple. For example we have a rarely-modified data structure. If somebody wants to change this data structure, we make a copy of this data structure and make all changes in the copy. In the same time all other users of the data structure use old version of it. Next, we need to choose safe moment when original version of the data structure will have no users and update it with the modified copy. `RCU` or read-copy update is a scalable high-performance synchronization mechanism implemented in the Linux kernel. On the early stage the linux kernel provided support and environment for the concurrently running applications, but all execution was serialized in the kernel using a single global lock. In our days linux kernel has no single global lock, but provides different mechanisms including [lock-free data structures](http://en.wikipedia.org/wiki/Concurrent_data_structure), [percpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1) data structures and other. One of these mechanisms is - the `read-copy update`. The `RCU` technique is designed for rarely-modified data structures. The idea of the `RCU` is simple. For example we have a rarely-modified data structure. If somebody wants to change this data structure, we make a copy of this data structure and make all changes in the copy. In the same time all other users of the data structure use old version of it. Next, we need to choose safe moment when original version of the data structure will have no users and update it with the modified copy.
Of course this description of the `RCU` is very simplified. To understand some details about `RCU`, first of all we need to learn some terminology. Data readers in the `RCU` executed in the [critical section](http://en.wikipedia.org/wiki/Critical_section). Every time when data reader get to the critical section, it calls the `rcu_read_lock`, and `rcu_read_unlock` on exit from the critical section. If the thread is not in the critical section, it will be in state which called - `quiescent state`. The moment when every thread is in the `quiescent state` called - `grace period`. If a thread wants to remove an element from the data structure, this occurs in two steps. First step is `removal` - atomically removes element from the data structure, but does not release the physical memory. After this thread-writer announces and waits until it is finished. From this moment, the removed element is available to the thread-readers. After the `grace period` finished, the second step of the element removal will be started, it just removes the element from the physical memory. Of course this description of the `RCU` is very simplified. To understand some details about `RCU`, first of all we need to learn some terminology. Data readers in the `RCU` executed in the [critical section](http://en.wikipedia.org/wiki/Critical_section). Every time when data reader get to the critical section, it calls the `rcu_read_lock`, and `rcu_read_unlock` on exit from the critical section. If the thread is not in the critical section, it will be in state which called - `quiescent state`. The moment when every thread is in the `quiescent state` called - `grace period`. If a thread wants to remove an element from the data structure, this occurs in two steps. First step is `removal` - atomically removes element from the data structure, but does not release the physical memory. After this thread-writer announces and waits until it is finished. From this moment, the removed element is available to the thread-readers. After the `grace period` finished, the second step of the element removal will be started, it just removes the element from the physical memory.
@ -378,7 +378,7 @@ Ok, we already passed the main theme of this part which is `RCU` initialization,
After we initialized `RCU`, the next step which you can see in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) is the - `trace_init` function. As you can understand from its name, this function initialize [tracing](http://en.wikipedia.org/wiki/Tracing_%28software%29) subsystem. You can read more about linux kernel trace system - [here](http://elinux.org/Kernel_Trace_Systems). After we initialized `RCU`, the next step which you can see in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) is the - `trace_init` function. As you can understand from its name, this function initialize [tracing](http://en.wikipedia.org/wiki/Tracing_%28software%29) subsystem. You can read more about linux kernel trace system - [here](http://elinux.org/Kernel_Trace_Systems).
After the `trace_init`, we can see the call of the `radix_tree_init`. If you are familiar with the different data structures, you can understand from the name of this function that it initializes kernel implementation of the [Radix tree](http://en.wikipedia.org/wiki/Radix_tree). This function is defined in the [lib/radix-tree.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/lib/radix-tree.c) and you can read more about it in the part about [Radix tree](https://0xax.gitbooks.io/linux-insides/content/DataStructures/linux-datastructures-2.html). After the `trace_init`, we can see the call of the `radix_tree_init`. If you are familiar with the different data structures, you can understand from the name of this function that it initializes kernel implementation of the [Radix tree](http://en.wikipedia.org/wiki/Radix_tree). This function is defined in the [lib/radix-tree.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/lib/radix-tree.c) and you can read more about it in the part about [Radix tree](https://0xax.gitbook.io/linux-insides/summary/datastructures/linux-datastructures-2).
In the next step we can see the functions which are related to the `interrupts handling` subsystem, they are: In the next step we can see the functions which are related to the `interrupts handling` subsystem, they are:
@ -394,18 +394,18 @@ The next couple of functions are related with the [perf](https://perf.wiki.kerne
local_irq_enable(); local_irq_enable();
``` ```
which expands to the `sti` instruction and making post initialization of the [SLAB](http://en.wikipedia.org/wiki/Slab_allocation) with the call of the `kmem_cache_init_late` function (As I wrote above we will know about the `SLAB` in the [Linux memory management](https://0xax.gitbooks.io/linux-insides/content/MM/index.html) chapter). which expands to the `sti` instruction and making post initialization of the [SLAB](http://en.wikipedia.org/wiki/Slab_allocation) with the call of the `kmem_cache_init_late` function (As I wrote above we will know about the `SLAB` in the [Linux memory management](https://0xax.gitbook.io/linux-insides/summary/mm) chapter).
After the post initialization of the `SLAB`, next point is initialization of the console with the `console_init` function from the [drivers/tty/tty_io.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/drivers/tty/tty_io.c). After the post initialization of the `SLAB`, next point is initialization of the console with the `console_init` function from the [drivers/tty/tty_io.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/drivers/tty/tty_io.c).
After the console initialization, we can see the `lockdep_info` function which prints information about the [Lock dependency validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt). After this, we can see the initialization of the dynamic allocation of the `debug objects` with the `debug_objects_mem_init`, kernel memory leak [detector](https://www.kernel.org/doc/Documentation/kmemleak.txt) initialization with the `kmemleak_init`, `percpu` pageset setup with the `setup_per_cpu_pageset`, setup of the [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) policy with the `numa_policy_init`, setting time for the scheduler with the `sched_clock_init`, `pidmap` initialization with the call of the `pidmap_init` function for the initial `PID` namespace, cache creation with the `anon_vma_init` for the private virtual memory areas and early initialization of the [ACPI](http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) with the `acpi_early_init`. After the console initialization, we can see the `lockdep_info` function which prints information about the [Lock dependency validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt). After this, we can see the initialization of the dynamic allocation of the `debug objects` with the `debug_objects_mem_init`, kernel memory leak [detector](https://www.kernel.org/doc/Documentation/kmemleak.txt) initialization with the `kmemleak_init`, `percpu` pageset setup with the `setup_per_cpu_pageset`, setup of the [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) policy with the `numa_policy_init`, setting time for the scheduler with the `sched_clock_init`, `pidmap` initialization with the call of the `pidmap_init` function for the initial `PID` namespace, cache creation with the `anon_vma_init` for the private virtual memory areas and early initialization of the [ACPI](http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) with the `acpi_early_init`.
This is the end of the ninth part of the [linux kernel initialization process](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and here we saw initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update). In the last paragraph of this part (`Rest of the initialization process`) we will go through many functions but did not dive into details about their implementations. Do not worry if you do not know anything about these stuff or you know and do not understand anything about this. As I already wrote many times, we will see details of implementations in other parts or other chapters. This is the end of the ninth part of the [linux kernel initialization process](https://0xax.gitbook.io/linux-insides/summary/initialization) and here we saw initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update). In the last paragraph of this part (`Rest of the initialization process`) we will go through many functions but did not dive into details about their implementations. Do not worry if you do not know anything about these stuff or you know and do not understand anything about this. As I already wrote many times, we will see details of implementations in other parts or other chapters.
Conclusion Conclusion
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
It is the end of the ninth part about the linux kernel [initialization process](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html). In this part, we looked on the initialization process of the `RCU` subsystem. In the next part we will continue to dive into linux kernel initialization process and I hope that we will finish with the `start_kernel` function and will go to the `rest_init` function from the same [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) source code file and will see the start of the first process. It is the end of the ninth part about the linux kernel [initialization process](https://0xax.gitbook.io/linux-insides/summary/initialization). In this part, we looked on the initialization process of the `RCU` subsystem. In the next part we will continue to dive into linux kernel initialization process and I hope that we will finish with the `start_kernel` function and will go to the `rest_init` function from the same [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) source code file and will see the start of the first process.
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
@ -423,8 +423,8 @@ Links
* [integer ID management](https://lwn.net/Articles/103209/) * [integer ID management](https://lwn.net/Articles/103209/)
* [Documentation/memory-barriers.txt](https://www.kernel.org/doc/Documentation/memory-barriers.txt) * [Documentation/memory-barriers.txt](https://www.kernel.org/doc/Documentation/memory-barriers.txt)
* [Runtime locking correctness validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) * [Runtime locking correctness validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt)
* [Per-CPU variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) * [Per-CPU variables](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1)
* [Linux kernel memory management](https://0xax.gitbooks.io/linux-insides/content/MM/index.html) * [Linux kernel memory management](https://0xax.gitbook.io/linux-insides/summary/mm)
* [slab](http://en.wikipedia.org/wiki/Slab_allocation) * [slab](http://en.wikipedia.org/wiki/Slab_allocation)
* [i2c](http://en.wikipedia.org/wiki/I%C2%B2C) * [i2c](http://en.wikipedia.org/wiki/I%C2%B2C)
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-8.html) * [Previous part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-8)

View File

@ -4,7 +4,7 @@ Interrupts and Interrupt Handling. Part 1.
Introduction Introduction
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is the first part of the new chapter of the [linux insides](https://0xax.gitbooks.io/linux-insides/content/) book. We have come a long way in the previous [chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) of this book. We started from the earliest [steps](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of kernel initialization and finished with the [launch](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-10.html) of the first `init` process. Yes, we saw several initialization steps which are related to the various kernel subsystems. But we did not dig deep into the details of these subsystems. With this chapter, we will try to understand how the various kernel subsystems work and how they are implemented. As you can already understand from the chapter's title, the first subsystem will be [interrupts](http://en.wikipedia.org/wiki/Interrupt). This is the first part of the new chapter of the [linux insides](https://github.com/0xAX/linux-insides/blob/master/SUMMARY.md) book. We have come a long way in the previous [chapter](https://0xax.gitbook.io/linux-insides/summary/initialization) of this book. We started from the earliest [steps](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-1) of kernel initialization and finished with the [launch](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-10) of the first `init` process. Yes, we saw several initialization steps which are related to the various kernel subsystems. But we did not dig deep into the details of these subsystems. With this chapter, we will try to understand how the various kernel subsystems work and how they are implemented. As you can already understand from the chapter's title, the first subsystem will be [interrupts](http://en.wikipedia.org/wiki/Interrupt).
What is an Interrupt? What is an Interrupt?
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
@ -37,7 +37,7 @@ Addresses of each of the interrupt handlers are maintained in a special location
BUG_ON((unsigned)n > 0xFF); BUG_ON((unsigned)n > 0xFF);
``` ```
You can find this check within the Linux kernel source code related to interrupt setup (e.g. The `set_intr_gate` in [arch/x86/kernel/idt.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/ldt.c)). The first 32 vector numbers from `0` to `31` are reserved by the processor and used for the processing of architecture-defined exceptions and interrupts. You can find the table with the description of these vector numbers in the second part of the Linux kernel initialization process - [Early interrupt and exception handling](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html). Vector numbers from `32` to `255` are designated as user-defined interrupts and are not reserved by the processor. These interrupts are generally assigned to external I/O devices to enable those devices to send interrupts to the processor. You can find this check within the Linux kernel source code related to interrupt setup (e.g. The `set_intr_gate` in [arch/x86/kernel/idt.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/ldt.c)). The first 32 vector numbers from `0` to `31` are reserved by the processor and used for the processing of architecture-defined exceptions and interrupts. You can find the table with the description of these vector numbers in the second part of the Linux kernel initialization process - [Early interrupt and exception handling](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-2). Vector numbers from `32` to `255` are designated as user-defined interrupts and are not reserved by the processor. These interrupts are generally assigned to external I/O devices to enable those devices to send interrupts to the processor.
Now let's talk about the types of interrupts. Broadly speaking, we can split interrupts into 2 major classes: Now let's talk about the types of interrupts. Broadly speaking, we can split interrupts into 2 major classes:
@ -58,7 +58,7 @@ Next a `trap` is an exception, which is reported immediately following the execu
Finally, an `abort` is an exception that does not always report the exact instruction which caused the exception and does not allow the interrupted program to be resumed. Finally, an `abort` is an exception that does not always report the exact instruction which caused the exception and does not allow the interrupted program to be resumed.
Also, we already know from the previous [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html) that interrupts can be classified as `maskable` and `non-maskable`. Maskable interrupts are interrupts which can be blocked with the two following instructions for `x86_64` - `sti` and `cli`. We can find them in the Linux kernel source code: Also, we already know from the previous [part](https://0xax.gitbook.io/linux-insides/summary/booting/linux-bootstrap-3) that interrupts can be classified as `maskable` and `non-maskable`. Maskable interrupts are interrupts which can be blocked with the two following instructions for `x86_64` - `sti` and `cli`. We can find them in the Linux kernel source code:
```C ```C
static inline void native_irq_disable(void) static inline void native_irq_disable(void)
@ -135,13 +135,13 @@ If multiple exceptions or interrupts occur at the same time, the processor handl
+--------------+-------------------------------------------------+ +--------------+-------------------------------------------------+
``` ```
Now that we know a little about the various types of interrupts and exceptions, it is time to move on to a more practical part. We start with the description of the `Interrupt Descriptor Table`. As mentioned earlier, the `IDT` stores entry points of the interrupts and exceptions handlers. The `IDT` is similar in structure to the `Global Descriptor Table` which we saw in the second part of the [Kernel booting process](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html). But of course it has some differences. Instead of `descriptors`, the `IDT` entries are called `gates`. It can contain one of the following gates: Now that we know a little about the various types of interrupts and exceptions, it is time to move on to a more practical part. We start with the description of the `Interrupt Descriptor Table`. As mentioned earlier, the `IDT` stores entry points of the interrupts and exceptions handlers. The `IDT` is similar in structure to the `Global Descriptor Table` which we saw in the second part of the [Kernel booting process](https://0xax.gitbook.io/linux-insides/summary/booting/linux-bootstrap-2). But of course it has some differences. Instead of `descriptors`, the `IDT` entries are called `gates`. It can contain one of the following gates:
* Interrupt gates * Interrupt gates
* Task gates * Task gates
* Trap gates. * Trap gates.
In the `x86` architecture. Only [long mode](http://en.wikipedia.org/wiki/Long_mode) interrupt gates and trap gates can be referenced in the `x86_64`. Like the `Global Descriptor Table`, the `Interrupt Descriptor table` is an array of 8-byte gates on `x86` and an array of 16-byte gates on `x86_64`. We can remember from the second part of the [Kernel booting process](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html), that `Global Descriptor Table` must contain `NULL` descriptor as its first element. Unlike the `Global Descriptor Table`, the `Interrupt Descriptor Table` may contain a gate; it is not mandatory. For example, you may remember that we have loaded the Interrupt Descriptor table with the `NULL` gates only in the earlier [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html) while transitioning into [protected mode](http://en.wikipedia.org/wiki/Protected_mode): In the `x86` architecture. Only [long mode](http://en.wikipedia.org/wiki/Long_mode) interrupt gates and trap gates can be referenced in the `x86_64`. Like the `Global Descriptor Table`, the `Interrupt Descriptor table` is an array of 8-byte gates on `x86` and an array of 16-byte gates on `x86_64`. We can remember from the second part of the [Kernel booting process](https://0xax.gitbook.io/linux-insides/summary/booting/linux-bootstrap-2), that `Global Descriptor Table` must contain `NULL` descriptor as its first element. Unlike the `Global Descriptor Table`, the `Interrupt Descriptor Table` may contain a gate; it is not mandatory. For example, you may remember that we have loaded the Interrupt Descriptor table with the `NULL` gates only in the earlier [part](https://0xax.gitbook.io/linux-insides/summary/booting/linux-bootstrap-3) while transitioning into [protected mode](http://en.wikipedia.org/wiki/Protected_mode):
```C ```C
/* /*
@ -274,7 +274,7 @@ The `PAGE_SIZE` is `4096`-bytes and the `THREAD_SIZE_ORDER` depends on the `KASA
#endif #endif
``` ```
`KASan` is a runtime memory [debugger](http://lwn.net/Articles/618180/). Thus, the `THREAD_SIZE` will be `16384` bytes if `CONFIG_KASAN` is disabled or `32768` if this kernel configuration option is enabled. These stacks contain useful data as long as a thread is alive or in a zombie state. While the thread is in user-space, the kernel stack is empty except for the `thread_info` structure (details about this structure are available in the fourth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process) at the bottom of the stack. The active or zombie threads aren't the only threads with their own stack. There also exist specialized stacks that are associated with each available CPU. These stacks are active when the kernel is executing on that CPU. When the user-space is executing on the CPU, these stacks do not contain any useful information. Each CPU has a few special per-cpu stacks as well. The first is the `interrupt stack` used for the external hardware interrupts. Its size is determined as follows: `KASan` is a runtime memory [debugger](http://lwn.net/Articles/618180/). Thus, the `THREAD_SIZE` will be `16384` bytes if `CONFIG_KASAN` is disabled or `32768` if this kernel configuration option is enabled. These stacks contain useful data as long as a thread is alive or in a zombie state. While the thread is in user-space, the kernel stack is empty except for the `thread_info` structure (details about this structure are available in the fourth [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-4) of the Linux kernel initialization process) at the bottom of the stack. The active or zombie threads aren't the only threads with their own stack. There also exist specialized stacks that are associated with each available CPU. These stacks are active when the kernel is executing on that CPU. When the user-space is executing on the CPU, these stacks do not contain any useful information. Each CPU has a few special per-cpu stacks as well. The first is the `interrupt stack` used for the external hardware interrupts. Its size is determined as follows:
```C ```C
#define IRQ_STACK_ORDER (2 + KASAN_STACK_ORDER) #define IRQ_STACK_ORDER (2 + KASAN_STACK_ORDER)
@ -309,7 +309,7 @@ struct fixed_percpu_data {
The `irq_stack` struct contains a 16 kilobytes array. The `irq_stack` struct contains a 16 kilobytes array.
Also, you can see that the fixed\_percpu\_data contains two fields: Also, you can see that the fixed\_percpu\_data contains two fields:
* `gs_base` - The `gs` register always points to the bottom of the `fixed_percpu_data`. On the `x86_64`, the `gs` register is shared by per-cpu area and stack canary (more about `per-cpu` variables you can read in the special [part](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html)). All per-cpu symbols are zero-based and the `gs` points to the base of the per-cpu area. You already know that [segmented memory model](http://en.wikipedia.org/wiki/Memory_segmentation) is abolished in the long mode, but we can set the base address for the two segment registers - `fs` and `gs` with the [Model specific registers](http://en.wikipedia.org/wiki/Model-specific_register) and these registers can be still be used as address registers. If you remember the first [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of the Linux kernel initialization process, you can remember that we have set the `gs` register: * `gs_base` - The `gs` register always points to the bottom of the `fixed_percpu_data`. On the `x86_64`, the `gs` register is shared by per-cpu area and stack canary (more about `per-cpu` variables you can read in the special [part](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1)). All per-cpu symbols are zero-based and the `gs` points to the base of the per-cpu area. You already know that [segmented memory model](http://en.wikipedia.org/wiki/Memory_segmentation) is abolished in the long mode, but we can set the base address for the two segment registers - `fs` and `gs` with the [Model specific registers](http://en.wikipedia.org/wiki/Model-specific_register) and these registers can be still be used as address registers. If you remember the first [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-1) of the Linux kernel initialization process, you can remember that we have set the `gs` register:
```assembly ```assembly
movl $MSR_GS_BASE,%ecx movl $MSR_GS_BASE,%ecx
@ -493,4 +493,4 @@ Links
* [segmented memory model](http://en.wikipedia.org/wiki/Memory_segmentation) * [segmented memory model](http://en.wikipedia.org/wiki/Memory_segmentation)
* [Model specific registers](http://en.wikipedia.org/wiki/Model-specific_register) * [Model specific registers](http://en.wikipedia.org/wiki/Model-specific_register)
* [Stack canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow#Stack_canaries) * [Stack canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow#Stack_canaries)
* [Previous chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) * [Previous chapter](https://0xax.gitbook.io/linux-insides/summary/initialization)

View File

@ -4,7 +4,7 @@ Interrupts and Interrupt Handling. Part 10.
Last part Last part
------------------------------------------------------------------------------- -------------------------------------------------------------------------------
This is the tenth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) about interrupts and interrupt handling in the Linux kernel and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-9.html) we saw a little about deferred interrupts and related concepts like `softirq`, `tasklet` and `workqeue`. In this part we will continue to dive into this theme and now it's time to look at real hardware driver. This is the tenth part of the [chapter](https://0xax.gitbook.io/linux-insides/summary/interrupts) about interrupts and interrupt handling in the Linux kernel and in the previous [part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-9) we saw a little about deferred interrupts and related concepts like `softirq`, `tasklet` and `workqeue`. In this part we will continue to dive into this theme and now it's time to look at real hardware driver.
Let's consider serial driver of the [StrongARM** SA-110/21285 Evaluation Board](http://netwinder.osuosl.org/pub/netwinder/docs/intel/datashts/27813501.pdf) board for example and will look how this driver requests an [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) line, Let's consider serial driver of the [StrongARM** SA-110/21285 Evaluation Board](http://netwinder.osuosl.org/pub/netwinder/docs/intel/datashts/27813501.pdf) board for example and will look how this driver requests an [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) line,
what happens when an interrupt is triggered and etc. The source code of this driver is placed in the [drivers/tty/serial/21285.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/drivers/tty/serial/21285.c) source code file. Ok, we have source code, let's start. what happens when an interrupt is triggered and etc. The source code of this driver is placed in the [drivers/tty/serial/21285.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/drivers/tty/serial/21285.c) source code file. Ok, we have source code, let's start.
@ -261,7 +261,7 @@ if (!action)
return -ENOMEM; return -ENOMEM;
``` ```
More about `kzalloc` will be in the separate chapter about [memory management](https://0xax.gitbooks.io/linux-insides/content/MM/index.html) in the Linux kernel. As we allocated space for the `irqaction`, we start to initialize this structure with the values of interrupt handler, interrupt flags, device name, etc: More about `kzalloc` will be in the separate chapter about [memory management](https://0xax.gitbook.io/linux-insides/summary/mm) in the Linux kernel. As we allocated space for the `irqaction`, we start to initialize this structure with the values of interrupt handler, interrupt flags, device name, etc:
```C ```C
action->handler = handler; action->handler = handler;
@ -301,7 +301,7 @@ And fill the rest of the given interrupt descriptor fields in the end. So, our `
Prepare to handle an interrupt Prepare to handle an interrupt
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
In the previous paragraph we saw the requesting of the irq line for the given interrupt descriptor and registration of the `irqaction` structure for the given interrupt. We already know that when an interrupt event occurs, an interrupt controller notifies the processor about this event and processor tries to find appropriate interrupt gate for this interrupt. If you have read the eighth [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-8.html) of this chapter, you may remember the `native_init_IRQ` function. This function makes initialization of the local [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller). The following part of this function is the most interesting part for us right now: In the previous paragraph we saw the requesting of the irq line for the given interrupt descriptor and registration of the `irqaction` structure for the given interrupt. We already know that when an interrupt event occurs, an interrupt controller notifies the processor about this event and processor tries to find appropriate interrupt gate for this interrupt. If you have read the eighth [part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-8) of this chapter, you may remember the `native_init_IRQ` function. This function makes initialization of the local [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller). The following part of this function is the most interesting part for us right now:
```C ```C
for_each_clear_bit_from(i, used_vectors, first_system_vector) { for_each_clear_bit_from(i, used_vectors, first_system_vector) {
@ -346,7 +346,7 @@ common_interrupt:
interrupt do_IRQ interrupt do_IRQ
``` ```
The macro `interrupt` defined in the same source code file and saves [general purpose](https://en.wikipedia.org/wiki/Processor_register) registers on the stack, change the userspace `gs` on the kernel with the `SWAPGS` assembler instruction if need, increase [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) - `irq_count` variable that shows that we are in interrupt and call the `do_IRQ` function. This function defined in the [arch/x86/kernel/irq.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/irq.c) source code file and handles our device interrupt. Let's look at this function. The `do_IRQ` function takes one parameter - `pt_regs` structure that stores values of the userspace registers: The macro `interrupt` defined in the same source code file and saves [general purpose](https://en.wikipedia.org/wiki/Processor_register) registers on the stack, change the userspace `gs` on the kernel with the `SWAPGS` assembler instruction if need, increase [per-cpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1) - `irq_count` variable that shows that we are in interrupt and call the `do_IRQ` function. This function defined in the [arch/x86/kernel/irq.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/irq.c) source code file and handles our device interrupt. Let's look at this function. The `do_IRQ` function takes one parameter - `pt_regs` structure that stores values of the userspace registers:
```C ```C
__visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs) __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
@ -413,7 +413,7 @@ We already know that when an `IRQ` finishes its work, deferred interrupts will b
Exit from interrupt Exit from interrupt
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
Ok, the interrupt handler finished its execution and now we must return from the interrupt. When the work of the `do_IRQ` function will be finsihed, we will return back to the assembler code in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry_entry_64.S) to the `ret_from_intr` label. First of all we disable interrupts with the `DISABLE_INTERRUPTS` macro that expands to the `cli` instruction and decreases value of the `irq_count` [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) variable. Remember, this variable had value - `1`, when we were in interrupt context: Ok, the interrupt handler finished its execution and now we must return from the interrupt. When the work of the `do_IRQ` function will be finsihed, we will return back to the assembler code in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry_entry_64.S) to the `ret_from_intr` label. First of all we disable interrupts with the `DISABLE_INTERRUPTS` macro that expands to the `cli` instruction and decreases value of the `irq_count` [per-cpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1) variable. Remember, this variable had value - `1`, when we were in interrupt context:
```assembly ```assembly
DISABLE_INTERRUPTS(CLBR_NONE) DISABLE_INTERRUPTS(CLBR_NONE)
@ -448,7 +448,7 @@ That's all.
Conclusion Conclusion
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
It is the end of the tenth part of the [Interrupts and Interrupt Handling](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) chapter and as you have read in the beginning of this part - it is the last part of this chapter. This chapter started from the explanation of the theory of interrupts and we have learned what is it interrupt and kinds of interrupts, then we saw exceptions and handling of this kind of interrupts, deferred interrupts and finally we looked on the hardware interrupts and the handling of theirs in this part. Of course, this part and even this chapter does not cover full aspects of interrupts and interrupt handling in the Linux kernel. It is not realistic to do this. At least for me. It was the big part, I don't know how about you, but it was really big for me. This theme is much bigger than this chapter and I am not sure that somewhere there is a book that covers it. We have missed many part and aspects of interrupts and interrupt handling, but I think it will be good point to dive in the kernel code related to the interrupts and interrupts handling. It is the end of the tenth part of the [Interrupts and Interrupt Handling](https://0xax.gitbook.io/linux-insides/summary/interrupts) chapter and as you have read in the beginning of this part - it is the last part of this chapter. This chapter started from the explanation of the theory of interrupts and we have learned what is it interrupt and kinds of interrupts, then we saw exceptions and handling of this kind of interrupts, deferred interrupts and finally we looked on the hardware interrupts and the handling of theirs in this part. Of course, this part and even this chapter does not cover full aspects of interrupts and interrupt handling in the Linux kernel. It is not realistic to do this. At least for me. It was the big part, I don't know how about you, but it was really big for me. This theme is much bigger than this chapter and I am not sure that somewhere there is a book that covers it. We have missed many part and aspects of interrupts and interrupt handling, but I think it will be good point to dive in the kernel code related to the interrupts and interrupts handling.
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
@ -464,13 +464,13 @@ Links
* [initcall](http://kernelnewbies.org/Documents/InitcallMechanism) * [initcall](http://kernelnewbies.org/Documents/InitcallMechanism)
* [uart](https://en.wikipedia.org/wiki/Universal_asynchronous_receiver/transmitter) * [uart](https://en.wikipedia.org/wiki/Universal_asynchronous_receiver/transmitter)
* [ISA](https://en.wikipedia.org/wiki/Industry_Standard_Architecture) * [ISA](https://en.wikipedia.org/wiki/Industry_Standard_Architecture)
* [memory management](https://0xax.gitbooks.io/linux-insides/content/MM/index.html) * [memory management](https://0xax.gitbook.io/linux-insides/summary/mm)
* [i2c](https://en.wikipedia.org/wiki/I%C2%B2C) * [i2c](https://en.wikipedia.org/wiki/I%C2%B2C)
* [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) * [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)
* [GNU assembler](https://en.wikipedia.org/wiki/GNU_Assembler) * [GNU assembler](https://en.wikipedia.org/wiki/GNU_Assembler)
* [Processor register](https://en.wikipedia.org/wiki/Processor_register) * [Processor register](https://en.wikipedia.org/wiki/Processor_register)
* [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) * [per-cpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1)
* [pid](https://en.wikipedia.org/wiki/Process_identifier) * [pid](https://en.wikipedia.org/wiki/Process_identifier)
* [device tree](https://en.wikipedia.org/wiki/Device_tree) * [device tree](https://en.wikipedia.org/wiki/Device_tree)
* [system calls](https://en.wikipedia.org/wiki/System_call) * [system calls](https://en.wikipedia.org/wiki/System_call)
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-9.html) * [Previous part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-9)

View File

@ -4,7 +4,7 @@ Interrupts and Interrupt Handling. Part 2.
Start to dive into interrupt and exceptions handling in the Linux kernel Start to dive into interrupt and exceptions handling in the Linux kernel
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
We saw some theory about interrupts and exception handling in the [introduction](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-1.html) and as I mentioned in that part, we will now start to dive into interrupts and exceptions within the Linux kernel source code. We'll commence by initializing the basic components as we did in the other chapters. But, we will not see the Linux kernel source code from the very early [code lines](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/header.S#L292), as this was presented in the example within the [Linux kernel booting process](https://0xax.gitbooks.io/linux-insides/content/Booting/index.html) chapter. In the beginning we will deal with the first sections of the Linux kernel source code, which are related to interrupts and exceptions. We saw some theory about interrupts and exception handling in the [introduction](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-1) and as I mentioned in that part, we will now start to dive into interrupts and exceptions within the Linux kernel source code. We'll commence by initializing the basic components as we did in the other chapters. But, we will not see the Linux kernel source code from the very early [code lines](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/header.S#L292), as this was presented in the example within the [Linux kernel booting process](https://0xax.gitbook.io/linux-insides/summary/booting) chapter. In the beginning we will deal with the first sections of the Linux kernel source code, which are related to interrupts and exceptions.
If you've read the previous parts, you can remember that the earliest place in the Linux kernel `x86_64` architecture-specific source code, which is related to the interrupt is located in the [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/pm.c) source code file and represents the first setup of the [Interrupt Descriptor Table](http://en.wikipedia.org/wiki/Interrupt_descriptor_table). It occurs right before the transition into the [protected mode](http://en.wikipedia.org/wiki/Protected_mode) in the `go_to_protected_mode` function by calling `setup_idt`: If you've read the previous parts, you can remember that the earliest place in the Linux kernel `x86_64` architecture-specific source code, which is related to the interrupt is located in the [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/pm.c) source code file and represents the first setup of the [Interrupt Descriptor Table](http://en.wikipedia.org/wiki/Interrupt_descriptor_table). It occurs right before the transition into the [protected mode](http://en.wikipedia.org/wiki/Protected_mode) in the `go_to_protected_mode` function by calling `setup_idt`:
@ -38,7 +38,7 @@ struct gdt_ptr {
Of course in our case the `gdt_ptr` does not represent the `GDTR` register, but `IDTR` since we set the `Interrupt Descriptor Table`. You will not find an `idt_ptr` structure, because if it had been in the Linux kernel source code, it would have been the same as a `gdt_ptr` but with a different name. It would make no sense to create two structures that only differ in their names. Note here that we do not fill the `Interrupt Descriptor Table` with entries, because it is too early to handle any interrupts or exceptions at this point. That's why we just fill the `IDT` with `NULL`. Of course in our case the `gdt_ptr` does not represent the `GDTR` register, but `IDTR` since we set the `Interrupt Descriptor Table`. You will not find an `idt_ptr` structure, because if it had been in the Linux kernel source code, it would have been the same as a `gdt_ptr` but with a different name. It would make no sense to create two structures that only differ in their names. Note here that we do not fill the `Interrupt Descriptor Table` with entries, because it is too early to handle any interrupts or exceptions at this point. That's why we just fill the `IDT` with `NULL`.
After the setup of the [Interrupt descriptor table](http://en.wikipedia.org/wiki/Interrupt_descriptor_table), [Global Descriptor Table](http://en.wikipedia.org/wiki/GDT) and other stuff we jump into [protected mode](http://en.wikipedia.org/wiki/Protected_mode) in the - [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/pmjump.S) file. You can read more about it in the [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html), which describes the transition to protected mode. After the setup of the [Interrupt descriptor table](http://en.wikipedia.org/wiki/Interrupt_descriptor_table), [Global Descriptor Table](http://en.wikipedia.org/wiki/GDT) and other stuff we jump into [protected mode](http://en.wikipedia.org/wiki/Protected_mode) in the - [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/pmjump.S) file. You can read more about it in the [part](https://0xax.gitbook.io/linux-insides/summary/booting/linux-bootstrap-3), which describes the transition to protected mode.
The entry to protected mode is located in the `boot_params.hdr.code32_start` and passed together with the `boot_params` to the `protected_mode_jump` function at the end of [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/pm.c): The entry to protected mode is located in the `boot_params.hdr.code32_start` and passed together with the `boot_params` to the `protected_mode_jump` function at the end of [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/pm.c):
@ -104,7 +104,7 @@ movl initial_gs+4(%rip),%edx
wrmsr wrmsr
``` ```
We already saw this code in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-1.html). First of all pay attention on the last `wrmsr` instruction. This instruction writes data from the `edx:eax` registers to the [model specific register](http://en.wikipedia.org/wiki/Model-specific_register) specified by the `ecx` register. We can see that `ecx` contains `$MSR_GS_BASE` which is declared in the [arch/x86/include/uapi/asm/msr-index.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/uapi/asm/msr-index.h) and looks like: We already saw this code in the previous [part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-1). First of all pay attention on the last `wrmsr` instruction. This instruction writes data from the `edx:eax` registers to the [model specific register](http://en.wikipedia.org/wiki/Model-specific_register) specified by the `ecx` register. We can see that `ecx` contains `$MSR_GS_BASE` which is declared in the [arch/x86/include/uapi/asm/msr-index.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/uapi/asm/msr-index.h) and looks like:
```C ```C
#define MSR_GS_BASE 0xc0000101 #define MSR_GS_BASE 0xc0000101
@ -178,7 +178,7 @@ movl initial_gs+4(%rip),%edx
wrmsr wrmsr
``` ```
Here we specified a model specific register with `MSR_GS_BASE`, put the 64-bit address of the `initial_gs` to the `edx:eax` pair and execute the `wrmsr` instruction for filling the `gs` register with the base address of the `init_per_cpu__irq_stack_union` which will be at the bottom of the interrupt stack. After this we will jump to the C code on the `x86_64_start_kernel` from the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head64.c). In the `x86_64_start_kernel` function we do the last preparations before we jump into the generic and architecture-independent kernel code and one of these preparations is filling the early `Interrupt Descriptor Table` with the interrupts handlers entries or `early_idt_handlers`. You can remember it, if you have read the part about the [Early interrupt and exception handling](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) and can remember following code: Here we specified a model specific register with `MSR_GS_BASE`, put the 64-bit address of the `initial_gs` to the `edx:eax` pair and execute the `wrmsr` instruction for filling the `gs` register with the base address of the `init_per_cpu__irq_stack_union` which will be at the bottom of the interrupt stack. After this we will jump to the C code on the `x86_64_start_kernel` from the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head64.c). In the `x86_64_start_kernel` function we do the last preparations before we jump into the generic and architecture-independent kernel code and one of these preparations is filling the early `Interrupt Descriptor Table` with the interrupts handlers entries or `early_idt_handlers`. You can remember it, if you have read the part about the [Early interrupt and exception handling](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-2) and can remember following code:
```C ```C
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
@ -219,12 +219,12 @@ ENTRY(early_idt_handler_array)
ENDPROC(early_idt_handler_common) ENDPROC(early_idt_handler_common)
``` ```
It fills `early_idt_handler_array` with the `.rept NUM_EXCEPTION_VECTORS` and contains entry of the `early_make_pgtable` interrupt handler (you can read more about its implementation in the part about [Early interrupt and exception handling](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html)). For now, we have reached the end of the x86_64 architecture-specific code and the next part is the generic kernel code. You probably already know, that we will return to the architecture-specific code in the `setup_arch` function and other places, but this is the end of the `x86_64` early code. It fills `early_idt_handler_array` with the `.rept NUM_EXCEPTION_VECTORS` and contains entry of the `early_make_pgtable` interrupt handler (you can read more about its implementation in the part about [Early interrupt and exception handling](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-2)). For now, we have reached the end of the x86_64 architecture-specific code and the next part is the generic kernel code. You probably already know, that we will return to the architecture-specific code in the `setup_arch` function and other places, but this is the end of the `x86_64` early code.
Setting stack canary for the interrupt stack Setting stack canary for the interrupt stack
------------------------------------------------------------------------------- -------------------------------------------------------------------------------
The next stop after the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head_64.S) is the biggest `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c). If you've read the previous [chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) about the Linux kernel initialization process, you must remember it. This function does all initialization stuff before kernel will launch first `init` process with the [pid](https://en.wikipedia.org/wiki/Process_identifier) - `1`. The first thing that is related to the interrupts and exceptions handling is the call of the `boot_init_stack_canary` function. The next stop after the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/head_64.S) is the biggest `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c). If you've read the previous [chapter](https://0xax.gitbook.io/linux-insides/summary/initialization) about the Linux kernel initialization process, you must remember it. This function does all initialization stuff before kernel will launch first `init` process with the [pid](https://en.wikipedia.org/wiki/Process_identifier) - `1`. The first thing that is related to the interrupts and exceptions handling is the call of the `boot_init_stack_canary` function.
This function sets the [canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow#Stack_canaries) value to protect interrupt stack overflow. We already saw a little some details about implementation of the `boot_init_stack_canary` in the previous part and now let's take a closer look on it. You can find implementation of this function in the [arch/x86/include/asm/stackprotector.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/stackprotector.h) and its depends on the `CONFIG_CC_STACKPROTECTOR` kernel configuration option. If this option is not set this function will not do anything: This function sets the [canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow#Stack_canaries) value to protect interrupt stack overflow. We already saw a little some details about implementation of the `boot_init_stack_canary` in the previous part and now let's take a closer look on it. You can find implementation of this function in the [arch/x86/include/asm/stackprotector.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/stackprotector.h) and its depends on the `CONFIG_CC_STACKPROTECTOR` kernel configuration option. If this option is not set this function will not do anything:
@ -240,7 +240,7 @@ static inline void boot_init_stack_canary(void)
#endif #endif
``` ```
If the `CONFIG_CC_STACKPROTECTOR` kernel configuration option is set, the `boot_init_stack_canary` function starts from the check stat `irq_stack_union` that represents [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) interrupt stack has offset equal to forty bytes from the `stack_canary` value: If the `CONFIG_CC_STACKPROTECTOR` kernel configuration option is set, the `boot_init_stack_canary` function starts from the check stat `irq_stack_union` that represents [per-cpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1) interrupt stack has offset equal to forty bytes from the `stack_canary` value:
```C ```C
#ifdef CONFIG_X86_64 #ifdef CONFIG_X86_64
@ -248,7 +248,7 @@ If the `CONFIG_CC_STACKPROTECTOR` kernel configuration option is set, the `boot_
#endif #endif
``` ```
As we can read in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-1.html) the `irq_stack_union` represented by the following union: As we can read in the previous [part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-1) the `irq_stack_union` represented by the following union:
```C ```C
union irq_stack_union { union irq_stack_union {
@ -261,7 +261,7 @@ union irq_stack_union {
}; };
``` ```
which defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/processor.h). We know that [union](http://en.wikipedia.org/wiki/Union_type) in the [C](http://en.wikipedia.org/wiki/C_%28programming_language%29) programming language is a data structure which stores only one field in a memory. We can see here that structure has first field - `gs_base` which is 40 bytes size and represents bottom of the `irq_stack`. So, after this our check with the `BUILD_BUG_ON` macro should end successfully. (you can read the first part about Linux kernel initialization [process](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) if you're interesting about the `BUILD_BUG_ON` macro). which defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/processor.h). We know that [union](http://en.wikipedia.org/wiki/Union_type) in the [C](http://en.wikipedia.org/wiki/C_%28programming_language%29) programming language is a data structure which stores only one field in a memory. We can see here that structure has first field - `gs_base` which is 40 bytes size and represents bottom of the `irq_stack`. So, after this our check with the `BUILD_BUG_ON` macro should end successfully. (you can read the first part about Linux kernel initialization [process](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-1) if you're interesting about the `BUILD_BUG_ON` macro).
After this we calculate new `canary` value based on the random number and [Time Stamp Counter](http://en.wikipedia.org/wiki/Time_Stamp_Counter): After this we calculate new `canary` value based on the random number and [Time Stamp Counter](http://en.wikipedia.org/wiki/Time_Stamp_Counter):
@ -397,7 +397,7 @@ WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
Early trap initialization during kernel initialization Early trap initialization during kernel initialization
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
The next functions after the `local_disable_irq` are `boot_cpu_init` and `page_address_init`, but they are not related to the interrupts and exceptions (more about this functions you can read in the chapter about Linux kernel [initialization process](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html)). The next is the `setup_arch` function. As you can remember this function located in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel.setup.c) source code file and makes initialization of many different architecture-dependent [stuff](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html). The first interrupts related function which we can see in the `setup_arch` is the - `early_trap_init` function. This function defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/traps.c) and fills `Interrupt Descriptor Table` with the couple of entries: The next functions after the `local_disable_irq` are `boot_cpu_init` and `page_address_init`, but they are not related to the interrupts and exceptions (more about this functions you can read in the chapter about Linux kernel [initialization process](https://0xax.gitbook.io/linux-insides/summary/initialization)). The next is the `setup_arch` function. As you can remember this function located in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel.setup.c) source code file and makes initialization of many different architecture-dependent [stuff](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-4). The first interrupts related function which we can see in the `setup_arch` is the - `early_trap_init` function. This function defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/traps.c) and fills `Interrupt Descriptor Table` with the couple of entries:
```C ```C
void __init early_trap_init(void) void __init early_trap_init(void)
@ -427,7 +427,7 @@ static inline void set_intr_gate_ist(int n, void *addr, unsigned ist)
} }
``` ```
First of all we can see the check that `n` which is [vector number](http://en.wikipedia.org/wiki/Interrupt_vector_table) of the interrupt is not greater than `0xff` or 255. We need to check it because we remember from the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-1.html) that vector number of an interrupt must be between `0` and `255`. In the next step we can see the call of the `_set_gate` function that sets a given interrupt gate to the `IDT` table: First of all we can see the check that `n` which is [vector number](http://en.wikipedia.org/wiki/Interrupt_vector_table) of the interrupt is not greater than `0xff` or 255. We need to check it because we remember from the previous [part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-1) that vector number of an interrupt must be between `0` and `255`. In the next step we can see the call of the `_set_gate` function that sets a given interrupt gate to the `IDT` table:
```C ```C
static inline void _set_gate(int gate, unsigned type, void *addr, static inline void _set_gate(int gate, unsigned type, void *addr,
@ -539,4 +539,4 @@ Links
* [vector number](http://en.wikipedia.org/wiki/Interrupt_vector_table) * [vector number](http://en.wikipedia.org/wiki/Interrupt_vector_table)
* [Interrupt Stack Table](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks) * [Interrupt Stack Table](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks)
* [Privilege level](http://en.wikipedia.org/wiki/Privilege_level) * [Privilege level](http://en.wikipedia.org/wiki/Privilege_level)
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-1.html) * [Previous part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-1)

View File

@ -4,7 +4,7 @@ Interrupts and Interrupt Handling. Part 3.
Exception Handling Exception Handling
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is the third part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) about an interrupts and an exceptions handling in the Linux kernel and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) we stopped at the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blame/master/arch/x86/kernel/setup.c) source code file. This is the third part of the [chapter](https://0xax.gitbook.io/linux-insides/summary/interrupts) about an interrupts and an exceptions handling in the Linux kernel and in the previous [part](https://0xax.gitbook.io/linux-insides/summary/interrupts) we stopped at the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blame/master/arch/x86/kernel/setup.c) source code file.
We already know that this function executes initialization of architecture-specific stuff. In our case the `setup_arch` function does [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture related initializations. The `setup_arch` is big function, and in the previous part we stopped on the setting of the two exceptions handlers for the two following exceptions: We already know that this function executes initialization of architecture-specific stuff. In our case the `setup_arch` function does [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture related initializations. The `setup_arch` is big function, and in the previous part we stopped on the setting of the two exceptions handlers for the two following exceptions:
@ -516,7 +516,7 @@ Links
* [system call](http://en.wikipedia.org/wiki/System_call) * [system call](http://en.wikipedia.org/wiki/System_call)
* [swapgs](http://www.felixcloutier.com/x86/SWAPGS.html) * [swapgs](http://www.felixcloutier.com/x86/SWAPGS.html)
* [SIGTRAP](https://en.wikipedia.org/wiki/Unix_signal#SIGTRAP) * [SIGTRAP](https://en.wikipedia.org/wiki/Unix_signal#SIGTRAP)
* [Per-CPU variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) * [Per-CPU variables](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1)
* [kgdb](https://en.wikipedia.org/wiki/KGDB) * [kgdb](https://en.wikipedia.org/wiki/KGDB)
* [ACPI](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) * [ACPI](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface)
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) * [Previous part](https://0xax.gitbook.io/linux-insides/summary/interrupts)

View File

@ -4,7 +4,7 @@ Interrupts and Interrupt Handling. Part 4.
Initialization of non-early interrupt gates Initialization of non-early interrupt gates
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is fourth part about an interrupts and exceptions handling in the Linux kernel and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-3.html) we saw first early `#DB` and `#BP` exceptions handlers from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). We stopped on the right after the `early_trap_init` function that called in the `setup_arch` function which defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/setup.c). In this part we will continue to dive into an interrupts and exceptions handling in the Linux kernel for `x86_64` and continue to do it from the place where we left off in the last part. First thing which is related to the interrupts and exceptions handling is the setup of the `#PF` or [page fault](https://en.wikipedia.org/wiki/Page_fault) handler with the `early_trap_pf_init` function. Let's start from it. This is fourth part about an interrupts and exceptions handling in the Linux kernel and in the previous [part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-3) we saw first early `#DB` and `#BP` exceptions handlers from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). We stopped on the right after the `early_trap_init` function that called in the `setup_arch` function which defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/setup.c). In this part we will continue to dive into an interrupts and exceptions handling in the Linux kernel for `x86_64` and continue to do it from the place where we left off in the last part. First thing which is related to the interrupts and exceptions handling is the setup of the `#PF` or [page fault](https://en.wikipedia.org/wiki/Page_fault) handler with the `early_trap_pf_init` function. Let's start from it.
Early page fault handler Early page fault handler
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
@ -20,7 +20,7 @@ void __init early_trap_pf_init(void)
} }
``` ```
This macro defined in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/desc.h). We already saw macros like this in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-3.html) - `set_system_intr_gate` and `set_intr_gate_ist`. This macro checks that given vector number is not greater than `255` (maximum vector number) and calls `_set_gate` function as `set_system_intr_gate` and `set_intr_gate_ist` did it: This macro defined in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/desc.h). We already saw macros like this in the previous [part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-3) - `set_system_intr_gate` and `set_intr_gate_ist`. This macro checks that given vector number is not greater than `255` (maximum vector number) and calls `_set_gate` function as `set_system_intr_gate` and `set_intr_gate_ist` did it:
```C ```C
#define set_intr_gate(n, addr) \ #define set_intr_gate(n, addr) \
@ -64,7 +64,7 @@ When the `early_trap_pf_init` will be called, the `set_intr_gate` will be expand
trace_idtentry page_fault do_page_fault has_error_code=1 trace_idtentry page_fault do_page_fault has_error_code=1
``` ```
We saw in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-3.html) how `#DB` and `#BP` handlers defined. They were defined with the `idtentry` macro, but here we can see `trace_idtentry`. This macro defined in the same source code file and depends on the `CONFIG_TRACING` kernel configuration option: We saw in the previous [part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-3) how `#DB` and `#BP` handlers defined. They were defined with the `idtentry` macro, but here we can see `trace_idtentry`. This macro defined in the same source code file and depends on the `CONFIG_TRACING` kernel configuration option:
```assembly ```assembly
#ifdef CONFIG_TRACING #ifdef CONFIG_TRACING
@ -79,7 +79,7 @@ idtentry \sym \do_sym has_error_code=\has_error_code
#endif #endif
``` ```
We will not dive into exceptions [Tracing](https://en.wikipedia.org/wiki/Tracing_%28software%29) now. If `CONFIG_TRACING` is not set, we can see that `trace_idtentry` macro just expands to the normal `idtentry`. We already saw implementation of the `idtentry` macro in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-3.html), so let's start from the `page_fault` exception handler. We will not dive into exceptions [Tracing](https://en.wikipedia.org/wiki/Tracing_%28software%29) now. If `CONFIG_TRACING` is not set, we can see that `trace_idtentry` macro just expands to the normal `idtentry`. We already saw implementation of the `idtentry` macro in the previous [part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-3), so let's start from the `page_fault` exception handler.
As we can see in the `idtentry` definition, the handler of the `page_fault` is `do_page_fault` function which defined in the [arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/mm/fault.c) and as all exceptions handlers it takes two arguments: As we can see in the `idtentry` definition, the handler of the `page_fault` is `do_page_fault` function which defined in the [arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/mm/fault.c) and as all exceptions handlers it takes two arguments:
@ -197,7 +197,7 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx)
} }
``` ```
Here we can see `proc_root_readdir` function which will be called when the Linux [VFS](https://en.wikipedia.org/wiki/Virtual_file_system) needs to read the `root` directory contents. If condition marked with `unlikely`, compiler can put `false` code right after branching. Now let's back to the our address check. Comparison between the given address and the `0x00007ffffffff000` will give us to know, was page fault in the kernel mode or user mode. After this check we know it. After this `__do_page_fault` routine will try to understand the problem that provoked page fault exception and then will pass address to the appropriate routine. It can be `kmemcheck` fault, spurious fault, [kprobes](https://www.kernel.org/doc/Documentation/kprobes.txt) fault and etc. Will not dive into implementation details of the page fault exception handler in this part, because we need to know many different concepts which are provided by the Linux kernel, but will see it in the chapter about the [memory management](https://0xax.gitbooks.io/linux-insides/content/MM/index.html) in the Linux kernel. Here we can see `proc_root_readdir` function which will be called when the Linux [VFS](https://en.wikipedia.org/wiki/Virtual_file_system) needs to read the `root` directory contents. If condition marked with `unlikely`, compiler can put `false` code right after branching. Now let's back to the our address check. Comparison between the given address and the `0x00007ffffffff000` will give us to know, was page fault in the kernel mode or user mode. After this check we know it. After this `__do_page_fault` routine will try to understand the problem that provoked page fault exception and then will pass address to the appropriate routine. It can be `kmemcheck` fault, spurious fault, [kprobes](https://www.kernel.org/doc/Documentation/kprobes.txt) fault and etc. Will not dive into implementation details of the page fault exception handler in this part, because we need to know many different concepts which are provided by the Linux kernel, but will see it in the chapter about the [memory management](https://0xax.gitbook.io/linux-insides/summary/mm) in the Linux kernel.
Back to start_kernel Back to start_kernel
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
@ -214,7 +214,7 @@ There are many different function calls after the `early_trap_pf_init` in the `s
#endif #endif
``` ```
Note that it depends on the `CONFIG_EISA` kernel configuration parameter which represents `EISA` support. Here we use `early_ioremap` function to map `I/O` memory on the page tables. We use `readl` function to read first `4` bytes from the mapped region and if they are equal to `EISA` string we set `EISA_bus` to one. In the end we just unmap previously mapped region. More about `early_ioremap` you can read in the part which describes [Fix-Mapped Addresses and ioremap](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-2.html). Note that it depends on the `CONFIG_EISA` kernel configuration parameter which represents `EISA` support. Here we use `early_ioremap` function to map `I/O` memory on the page tables. We use `readl` function to read first `4` bytes from the mapped region and if they are equal to `EISA` string we set `EISA_bus` to one. In the end we just unmap previously mapped region. More about `early_ioremap` you can read in the part which describes [Fix-Mapped Addresses and ioremap](https://0xax.gitbook.io/linux-insides/summary/mm/linux-mm-2).
After this we start to fill the `Interrupt Descriptor Table` with the different interrupt gates. First of all we set `#DE` or `Divide Error` and `#NMI` or `Non-maskable Interrupt`: After this we start to fill the `Interrupt Descriptor Table` with the different interrupt gates. First of all we set `#DE` or `Divide Error` and `#NMI` or `Non-maskable Interrupt`:
@ -223,7 +223,7 @@ set_intr_gate(X86_TRAP_DE, divide_error);
set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK); set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK);
``` ```
We use `set_intr_gate` macro to set the interrupt gate for the `#DE` exception and `set_intr_gate_ist` for the `#NMI`. You can remember that we already used these macros when we have set the interrupts gates for the page fault handler, debug handler and etc, you can find explanation of it in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-3.html). After this we setup exception gates for the following exceptions: We use `set_intr_gate` macro to set the interrupt gate for the `#DE` exception and `set_intr_gate_ist` for the `#NMI`. You can remember that we already used these macros when we have set the interrupts gates for the page fault handler, debug handler and etc, you can find explanation of it in the previous [part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-3). After this we setup exception gates for the following exceptions:
```C ```C
set_system_intr_gate(X86_TRAP_OF, &overflow); set_system_intr_gate(X86_TRAP_OF, &overflow);
@ -300,7 +300,7 @@ In the next step we fill the `used_vectors` array which defined in the [arch/x86
DECLARE_BITMAP(used_vectors, NR_VECTORS); DECLARE_BITMAP(used_vectors, NR_VECTORS);
``` ```
of the first `32` interrupts (more about bitmaps in the Linux kernel you can read in the part which describes [cpumasks and bitmaps](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html)) of the first `32` interrupts (more about bitmaps in the Linux kernel you can read in the part which describes [cpumasks and bitmaps](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-2))
```C ```C
for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++) for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++)
@ -329,7 +329,7 @@ __set_fixmap(FIX_RO_IDT, __pa_symbol(idt_table), PAGE_KERNEL_RO);
idt_descr.address = fix_to_virt(FIX_RO_IDT); idt_descr.address = fix_to_virt(FIX_RO_IDT);
``` ```
and write its address to the `idt_descr.address` (more about fix-mapped addresses you can read in the second part of the [Linux kernel memory management](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-2.html) chapter). After this we can see the call of the `cpu_init` function that defined in the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/cpu/common.c). This function makes initialization of the all `per-cpu` state. In the beginning of the `cpu_init` we do the following things: First of all we wait while current cpu is initialized and than we call the `cr4_init_shadow` function which stores shadow copy of the `cr4` control register for the current cpu and load CPU microcode if need with the following function calls: and write its address to the `idt_descr.address` (more about fix-mapped addresses you can read in the second part of the [Linux kernel memory management](https://0xax.gitbook.io/linux-insides/summary/mm/linux-mm-2) chapter). After this we can see the call of the `cpu_init` function that defined in the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/cpu/common.c). This function makes initialization of the all `per-cpu` state. In the beginning of the `cpu_init` we do the following things: First of all we wait while current cpu is initialized and than we call the `cr4_init_shadow` function which stores shadow copy of the `cr4` control register for the current cpu and load CPU microcode if need with the following function calls:
```C ```C
wait_for_master_cpu(cpu); wait_for_master_cpu(cpu);
@ -421,7 +421,7 @@ set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
#endif #endif
``` ```
Here we copy `idt_table` to the `nmi_dit_table` and setup exception handlers for the `#DB` or `Debug exception` and `#BR` or `Breakpoint exception`. You can remember that we already set these interrupt gates in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-3.html), so why do we need to setup it again? We setup it again because when we initialized it before in the `early_trap_init` function, the `Task State Segment` was not ready yet, but now it is ready after the call of the `cpu_init` function. Here we copy `idt_table` to the `nmi_dit_table` and setup exception handlers for the `#DB` or `Debug exception` and `#BR` or `Breakpoint exception`. You can remember that we already set these interrupt gates in the previous [part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-3), so why do we need to setup it again? We setup it again because when we initialized it before in the `early_trap_init` function, the `Task State Segment` was not ready yet, but now it is ready after the call of the `cpu_init` function.
That's all. Soon we will consider all handlers of these interrupts/exceptions. That's all. Soon we will consider all handlers of these interrupts/exceptions.
@ -448,8 +448,8 @@ Links
* [3DNow](https://en.wikipedia.org/?title=3DNow!) * [3DNow](https://en.wikipedia.org/?title=3DNow!)
* [CPU caches](https://en.wikipedia.org/wiki/CPU_cache) * [CPU caches](https://en.wikipedia.org/wiki/CPU_cache)
* [VFS](https://en.wikipedia.org/wiki/Virtual_file_system) * [VFS](https://en.wikipedia.org/wiki/Virtual_file_system)
* [Linux kernel memory management](https://0xax.gitbooks.io/linux-insides/content/MM/index.html) * [Linux kernel memory management](https://0xax.gitbook.io/linux-insides/summary/mm)
* [Fix-Mapped Addresses and ioremap](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-2.html) * [Fix-Mapped Addresses and ioremap](https://0xax.gitbook.io/linux-insides/summary/mm/linux-mm-2)
* [Extended Industry Standard Architecture](https://en.wikipedia.org/wiki/Extended_Industry_Standard_Architecture) * [Extended Industry Standard Architecture](https://en.wikipedia.org/wiki/Extended_Industry_Standard_Architecture)
* [INT isntruction](https://en.wikipedia.org/wiki/INT_%28x86_instruction%29) * [INT isntruction](https://en.wikipedia.org/wiki/INT_%28x86_instruction%29)
* [INTO](http://x86.renejeschke.de/html/file_module_x86_id_142.html) * [INTO](http://x86.renejeschke.de/html/file_module_x86_id_142.html)
@ -459,7 +459,7 @@ Links
* [x87 FPU](https://en.wikipedia.org/wiki/X86_instruction_listings#x87_floating-point_instructions) * [x87 FPU](https://en.wikipedia.org/wiki/X86_instruction_listings#x87_floating-point_instructions)
* [MCE exception](https://en.wikipedia.org/wiki/Machine-check_exception) * [MCE exception](https://en.wikipedia.org/wiki/Machine-check_exception)
* [SIMD](https://en.wikipedia.org/?title=SIMD) * [SIMD](https://en.wikipedia.org/?title=SIMD)
* [cpumasks and bitmaps](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) * [cpumasks and bitmaps](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-2)
* [NX](https://en.wikipedia.org/wiki/NX_bit) * [NX](https://en.wikipedia.org/wiki/NX_bit)
* [Task State Segment](https://en.wikipedia.org/wiki/Task_state_segment) * [Task State Segment](https://en.wikipedia.org/wiki/Task_state_segment)
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-3.html) * [Previous part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-3)

View File

@ -4,7 +4,7 @@ Interrupts and Interrupt Handling. Part 5.
Implementation of exception handlers Implementation of exception handlers
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is the fifth part about an interrupts and exceptions handling in the Linux kernel and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-4.html) we stopped on the setting of interrupt gates to the [Interrupt descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table). We did it in the `trap_init` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c) source code file. We saw only setting of these interrupt gates in the previous part and in the current part we will see implementation of the exception handlers for these gates. The preparation before an exception handler will be executed is in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/entry_64.S) assembly file and occurs in the [idtentry](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/entry_64.S#L820) macro that defines exceptions entry points: This is the fifth part about an interrupts and exceptions handling in the Linux kernel and in the previous [part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-4) we stopped on the setting of interrupt gates to the [Interrupt descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table). We did it in the `trap_init` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c) source code file. We saw only setting of these interrupt gates in the previous part and in the current part we will see implementation of the exception handlers for these gates. The preparation before an exception handler will be executed is in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/entry_64.S) assembly file and occurs in the [idtentry](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/entry_64.S#L820) macro that defines exceptions entry points:
```assembly ```assembly
idtentry divide_error do_divide_error has_error_code=0 idtentry divide_error do_divide_error has_error_code=0
@ -62,7 +62,7 @@ native_irq_return_iret:
iretq iretq
``` ```
More about the `idtentry` macro you can read in the third part of the [https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-3.html](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-3.html) chapter. Ok, now we saw the preparation before an exception handler will be executed and now time to look on the handlers. First of all let's look on the following handlers: More about the `idtentry` macro you can read in the third part of the [https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-3.html](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-3) chapter. Ok, now we saw the preparation before an exception handler will be executed and now time to look on the handlers. First of all let's look on the following handlers:
* divide_error * divide_error
* overflow * overflow
@ -211,7 +211,7 @@ static inline void conditional_sti(struct pt_regs *regs)
} }
``` ```
more about `local_irq_enable` macro you can read in the second [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-2.html) of this chapter. The next and last call in the `do_error_trap` is the `do_trap` function. First of all the `do_trap` function defined the `tsk` variable which has `task_struct` type and represents the current interrupted process. After the definition of the `tsk`, we can see the call of the `do_trap_no_signal` function: more about `local_irq_enable` macro you can read in the second [part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-2) of this chapter. The next and last call in the `do_error_trap` is the `do_trap` function. First of all the `do_trap` function defined the `tsk` variable which has `task_struct` type and represents the current interrupted process. After the definition of the `tsk`, we can see the call of the `do_trap_no_signal` function:
```C ```C
struct task_struct *tsk = current; struct task_struct *tsk = current;
@ -463,7 +463,7 @@ That's all.
Conclusion Conclusion
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
It is the end of the fifth part of the [Interrupts and Interrupt Handling](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) chapter and we saw implementation of some interrupt handlers in this part. In the next part we will continue to dive into interrupt and exception handlers and will see handler for the [Non-Maskable Interrupts](https://en.wikipedia.org/wiki/Non-maskable_interrupt), handling of the math [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) and [SIMD](https://en.wikipedia.org/wiki/SIMD) coprocessor exceptions and many many more. It is the end of the fifth part of the [Interrupts and Interrupt Handling](https://0xax.gitbook.io/linux-insides/summary/interrupts) chapter and we saw implementation of some interrupt handlers in this part. In the next part we will continue to dive into interrupt and exception handlers and will see handler for the [Non-Maskable Interrupts](https://en.wikipedia.org/wiki/Non-maskable_interrupt), handling of the math [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) and [SIMD](https://en.wikipedia.org/wiki/SIMD) coprocessor exceptions and many many more.
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
@ -490,4 +490,4 @@ Links
* [x87 FPU](https://en.wikipedia.org/wiki/X87) * [x87 FPU](https://en.wikipedia.org/wiki/X87)
* [control register](https://en.wikipedia.org/wiki/Control_register) * [control register](https://en.wikipedia.org/wiki/Control_register)
* [MMX](https://en.wikipedia.org/wiki/MMX_%28instruction_set%29) * [MMX](https://en.wikipedia.org/wiki/MMX_%28instruction_set%29)
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-4.html) * [Previous part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-4)

View File

@ -4,7 +4,7 @@ Interrupts and Interrupt Handling. Part 6.
Non-maskable interrupt handler Non-maskable interrupt handler
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
It is sixth part of the [Interrupts and Interrupt Handling in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) chapter and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-5.html) we saw implementation of some exception handlers for the [General Protection Fault](https://en.wikipedia.org/wiki/General_protection_fault) exception, divide exception, invalid [opcode](https://en.wikipedia.org/wiki/Opcode) exceptions and etc. As I wrote in the previous part we will see implementations of the rest exceptions in this part. We will see implementation of the following handlers: It is sixth part of the [Interrupts and Interrupt Handling in the Linux kernel](https://0xax.gitbook.io/linux-insides/summary/interrupts) chapter and in the previous [part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-5) we saw implementation of some exception handlers for the [General Protection Fault](https://en.wikipedia.org/wiki/General_protection_fault) exception, divide exception, invalid [opcode](https://en.wikipedia.org/wiki/Opcode) exceptions and etc. As I wrote in the previous part we will see implementations of the rest exceptions in this part. We will see implementation of the following handlers:
* [Non-Maskable](https://en.wikipedia.org/wiki/Non-maskable_interrupt) interrupt; * [Non-Maskable](https://en.wikipedia.org/wiki/Non-maskable_interrupt) interrupt;
* [BOUND](http://pdos.csail.mit.edu/6.828/2005/readings/i386/BOUND.htm) Range Exceeded Exception; * [BOUND](http://pdos.csail.mit.edu/6.828/2005/readings/i386/BOUND.htm) Range Exceeded Exception;
@ -21,13 +21,13 @@ A [Non-Maskable](https://en.wikipedia.org/wiki/Non-maskable_interrupt) interrupt
* External hardware asserts the non-maskable interrupt [pin](https://en.wikipedia.org/wiki/CPU_socket) on the CPU. * External hardware asserts the non-maskable interrupt [pin](https://en.wikipedia.org/wiki/CPU_socket) on the CPU.
* The processor receives a message on the system bus or the APIC serial bus with a delivery mode `NMI`. * The processor receives a message on the system bus or the APIC serial bus with a delivery mode `NMI`.
When the processor receives a `NMI` from one of these sources, the processor handles it immediately by calling the `NMI` handler pointed to by interrupt vector which has number `2` (see table in the first [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-1.html)). We already filled the [Interrupt Descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table) with the [vector number](https://en.wikipedia.org/wiki/Interrupt_vector_table), address of the `nmi` interrupt handler and `NMI_STACK` [Interrupt Stack Table entry](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/x86/kernel-stacks): When the processor receives a `NMI` from one of these sources, the processor handles it immediately by calling the `NMI` handler pointed to by interrupt vector which has number `2` (see table in the first [part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-1)). We already filled the [Interrupt Descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table) with the [vector number](https://en.wikipedia.org/wiki/Interrupt_vector_table), address of the `nmi` interrupt handler and `NMI_STACK` [Interrupt Stack Table entry](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/x86/kernel-stacks):
```C ```C
set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK); set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK);
``` ```
in the `trap_init` function which defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/traps.c) source code file. In the previous [parts](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) we saw that entry points of the all interrupt handlers are defined with the: in the `trap_init` function which defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/traps.c) source code file. In the previous [parts](https://0xax.gitbook.io/linux-insides/summary/interrupts) we saw that entry points of the all interrupt handlers are defined with the:
```assembly ```assembly
.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 .macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
@ -75,7 +75,7 @@ The `__KERNEL_CS` macro defined in the [arch/x86/include/asm/segment.h](https://
#define __KERNEL_CS (GDT_ENTRY_KERNEL_CS*8) #define __KERNEL_CS (GDT_ENTRY_KERNEL_CS*8)
``` ```
more about `GDT` you can read in the second [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) of the Linux kernel booting process chapter. If `cs` is not kernel segment, it means that it is not nested `NMI` and we jump on the `first_nmi` label. Let's consider this case. First of all we put address of the current stack pointer to the `rdx` and pushes `1` to the stack in the `first_nmi` label: more about `GDT` you can read in the second [part](https://0xax.gitbook.io/linux-insides/summary/booting/linux-bootstrap-2) of the Linux kernel booting process chapter. If `cs` is not kernel segment, it means that it is not nested `NMI` and we jump on the `first_nmi` label. Let's consider this case. First of all we put address of the current stack pointer to the `rdx` and pushes `1` to the stack in the `first_nmi` label:
```assembly ```assembly
first_nmi: first_nmi:
@ -169,7 +169,7 @@ pushq $-1
ALLOC_PT_GPREGS_ON_STACK ALLOC_PT_GPREGS_ON_STACK
``` ```
We already saw implementation of the `ALLOC_PT_GREGS_ON_STACK` macro in the third part of the interrupts [chapter](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-3.html). This macro defined in the [arch/x86/entry/calling.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/calling.h) and yet another allocates `120` bytes on stack for the general purpose registers, from the `rdi` to the `r15`: We already saw implementation of the `ALLOC_PT_GREGS_ON_STACK` macro in the third part of the interrupts [chapter](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-3). This macro defined in the [arch/x86/entry/calling.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/calling.h) and yet another allocates `120` bytes on stack for the general purpose registers, from the `rdi` to the `r15`:
```assembly ```assembly
.macro ALLOC_PT_GPREGS_ON_STACK addskip=0 .macro ALLOC_PT_GPREGS_ON_STACK addskip=0
@ -260,7 +260,7 @@ Now let's look on the `do_nmi` exception handler. This function defined in the [
* address of the `pt_regs`; * address of the `pt_regs`;
* error code. * error code.
as all exception handlers. The `do_nmi` starts from the call of the `nmi_nesting_preprocess` function and ends with the call of the `nmi_nesting_postprocess`. The `nmi_nesting_preprocess` function checks that we likely do not work with the debug stack and if we on the debug stack set the `update_debug_stack` [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) variable to `1` and call the `debug_stack_set_zero` function from the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/cpu/common.c). This function increases the `debug_stack_use_ctr` per-cpu variable and loads new `Interrupt Descriptor Table`: as all exception handlers. The `do_nmi` starts from the call of the `nmi_nesting_preprocess` function and ends with the call of the `nmi_nesting_postprocess`. The `nmi_nesting_preprocess` function checks that we likely do not work with the debug stack and if we on the debug stack set the `update_debug_stack` [per-cpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1) variable to `1` and call the `debug_stack_set_zero` function from the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/cpu/common.c). This function increases the `debug_stack_use_ctr` per-cpu variable and loads new `Interrupt Descriptor Table`:
```C ```C
static inline void nmi_nesting_preprocess(struct pt_regs *regs) static inline void nmi_nesting_preprocess(struct pt_regs *regs)
@ -325,7 +325,7 @@ exception_exit(prev_state);
return; return;
``` ```
After we have got the state of the previous context, we add the exception to the `notify_die` chain and if it will return `NOTIFY_STOP` we return from the exception. More about notify chains and the `context tracking` functions you can read in the [previous part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-5.html). In the next step we enable interrupts if they were disabled with the `contidional_sti` function that checks `IF` flag and call the `local_irq_enable` depends on its value: After we have got the state of the previous context, we add the exception to the `notify_die` chain and if it will return `NOTIFY_STOP` we return from the exception. More about notify chains and the `context tracking` functions you can read in the [previous part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-5). In the next step we enable interrupts if they were disabled with the `contidional_sti` function that checks `IF` flag and call the `local_irq_enable` depends on its value:
```C ```C
conditional_sti(regs); conditional_sti(regs);
@ -446,7 +446,7 @@ That's all.
Conclusion Conclusion
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
It is the end of the sixth part of the [Interrupts and Interrupt Handling](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) chapter and we saw implementation of some exception handlers in this part, like `non-maskable` interrupt, [SIMD](https://en.wikipedia.org/wiki/SIMD) and [x87 FPU](https://en.wikipedia.org/wiki/X87) floating point exception. Finally we have finsihed with the `trap_init` function in this part and will go ahead in the next part. The next our point is the external interrupts and the `early_irq_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c). It is the end of the sixth part of the [Interrupts and Interrupt Handling](https://0xax.gitbook.io/linux-insides/summary/interrupts) chapter and we saw implementation of some exception handlers in this part, like `non-maskable` interrupt, [SIMD](https://en.wikipedia.org/wiki/SIMD) and [x87 FPU](https://en.wikipedia.org/wiki/X87) floating point exception. Finally we have finsihed with the `trap_init` function in this part and will go ahead in the next part. The next our point is the external interrupts and the `early_irq_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c).
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
@ -473,8 +473,8 @@ Links
* [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table) * [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table)
* [stack frame](https://en.wikipedia.org/wiki/Call_stack) * [stack frame](https://en.wikipedia.org/wiki/Call_stack)
* [Model Specific regiser](https://en.wikipedia.org/wiki/Model-specific_register) * [Model Specific regiser](https://en.wikipedia.org/wiki/Model-specific_register)
* [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) * [percpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1)
* [RCU](https://en.wikipedia.org/wiki/Read-copy-update) * [RCU](https://en.wikipedia.org/wiki/Read-copy-update)
* [MPX](https://en.wikipedia.org/wiki/Intel_MPX) * [MPX](https://en.wikipedia.org/wiki/Intel_MPX)
* [x87 FPU](https://en.wikipedia.org/wiki/X87) * [x87 FPU](https://en.wikipedia.org/wiki/X87)
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-5.html) * [Previous part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-5)

View File

@ -4,7 +4,7 @@ Interrupts and Interrupt Handling. Part 7.
Introduction to external interrupts Introduction to external interrupts
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is the seventh part of the Interrupts and Interrupt Handling in the Linux kernel [chapter](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-6.html) we have finished with the exceptions which are generated by the processor. In this part we will continue to dive to the interrupt handling and will start with the external hardware interrupt handling. As you can remember, in the previous part we have finished with the `trap_init` function from the [arch/x86/kernel/trap.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/traps.c) and the next step is the call of the `early_irq_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c). This is the seventh part of the Interrupts and Interrupt Handling in the Linux kernel [chapter](https://0xax.gitbook.io/linux-insides/summary/interrupts) and in the previous [part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-6) we have finished with the exceptions which are generated by the processor. In this part we will continue to dive to the interrupt handling and will start with the external hardware interrupt handling. As you can remember, in the previous part we have finished with the `trap_init` function from the [arch/x86/kernel/trap.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/traps.c) and the next step is the call of the `early_irq_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c).
Interrupts are signal that are sent across [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) or `Interrupt Request Line` by a hardware or software. External hardware interrupts allow devices like keyboard, mouse and etc, to indicate that it needs attention of the processor. Once the processor receives the `Interrupt Request`, it will temporary stop execution of the running program and invoke special routine which depends on an interrupt. We already know that this routine is called interrupt handler (or how we will call it `ISR` or `Interrupt Service Routine` from this part). The `ISR` or `Interrupt Handler Routine` can be found in Interrupt Vector table that is located at fixed address in the memory. After the interrupt is handled processor resumes the interrupted process. At the boot/initialization time, the Linux kernel identifies all devices in the machine, and appropriate interrupt handlers are loaded into the interrupt table. As we saw in the previous parts, most exceptions are handled simply by the sending a [Unix signal](https://en.wikipedia.org/wiki/Unix_signal) to the interrupted process. That's why kernel is can handle an exception quickly. Unfortunately we can not use this approach for the external hardware interrupts, because often they arrive after (and sometimes long after) the process to which they are related has been suspended. So it would make no sense to send a Unix signal to the current process. External interrupt handling depends on the type of an interrupt: Interrupts are signal that are sent across [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) or `Interrupt Request Line` by a hardware or software. External hardware interrupts allow devices like keyboard, mouse and etc, to indicate that it needs attention of the processor. Once the processor receives the `Interrupt Request`, it will temporary stop execution of the running program and invoke special routine which depends on an interrupt. We already know that this routine is called interrupt handler (or how we will call it `ISR` or `Interrupt Service Routine` from this part). The `ISR` or `Interrupt Handler Routine` can be found in Interrupt Vector table that is located at fixed address in the memory. After the interrupt is handled processor resumes the interrupted process. At the boot/initialization time, the Linux kernel identifies all devices in the machine, and appropriate interrupt handlers are loaded into the interrupt table. As we saw in the previous parts, most exceptions are handled simply by the sending a [Unix signal](https://en.wikipedia.org/wiki/Unix_signal) to the interrupted process. That's why kernel is can handle an exception quickly. Unfortunately we can not use this approach for the external hardware interrupts, because often they arrive after (and sometimes long after) the process to which they are related has been suspended. So it would make no sense to send a Unix signal to the current process. External interrupt handling depends on the type of an interrupt:
@ -95,7 +95,7 @@ More about this will be in the another chapter about the `NUMA`. The next step a
init_irq_default_affinity(); init_irq_default_affinity();
``` ```
function. The `init_irq_default_affinity` function defined in the same source code file and depends on the `CONFIG_SMP` kernel configuration option allocates a given [cpumask](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) structure (in our case it is the `irq_default_affinity`): function. The `init_irq_default_affinity` function defined in the same source code file and depends on the `CONFIG_SMP` kernel configuration option allocates a given [cpumask](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-2) structure (in our case it is the `irq_default_affinity`):
```C ```C
#if defined(CONFIG_SMP) #if defined(CONFIG_SMP)
@ -207,7 +207,7 @@ for (i = 0; i < count; i++) {
We are going through the all interrupt descriptors and do the following things: We are going through the all interrupt descriptors and do the following things:
First of all we allocate [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) variable for the `irq` kernel statistic with the `alloc_percpu` macro. This macro allocates one instance of an object of the given type for every processor on the system. You can access kernel statistic from the userspace via `/proc/stat`: First of all we allocate [percpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1) variable for the `irq` kernel statistic with the `alloc_percpu` macro. This macro allocates one instance of an object of the given type for every processor on the system. You can access kernel statistic from the userspace via `/proc/stat`:
``` ```
~$ cat /proc/stat ~$ cat /proc/stat
@ -221,7 +221,7 @@ cpu3 26648 8 6931 678891 414 0 244 0 0 0
... ...
``` ```
Where the sixth column is the servicing interrupts. After this we allocate [cpumask](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) for the given irq descriptor affinity and initialize the [spinlock](https://en.wikipedia.org/wiki/Spinlock) for the given interrupt descriptor. After this before the [critical section](https://en.wikipedia.org/wiki/Critical_section), the lock will be acquired with a call of the `raw_spin_lock` and unlocked with the call of the `raw_spin_unlock`. In the next step we call the `lockdep_set_class` macro which set the [Lock validator](https://lwn.net/Articles/185666/) `irq_desc_lock_class` class for the lock of the given interrupt descriptor. More about `lockdep`, `spinlock` and other synchronization primitives will be described in the separate chapter. Where the sixth column is the servicing interrupts. After this we allocate [cpumask](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-2) for the given irq descriptor affinity and initialize the [spinlock](https://en.wikipedia.org/wiki/Spinlock) for the given interrupt descriptor. After this before the [critical section](https://en.wikipedia.org/wiki/Critical_section), the lock will be acquired with a call of the `raw_spin_lock` and unlocked with the call of the `raw_spin_unlock`. In the next step we call the `lockdep_set_class` macro which set the [Lock validator](https://lwn.net/Articles/185666/) `irq_desc_lock_class` class for the lock of the given interrupt descriptor. More about `lockdep`, `spinlock` and other synchronization primitives will be described in the separate chapter.
In the end of the loop we call the `desc_set_defaults` function from the [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/irq/irqdesc.c). This function takes four parameters: In the end of the loop we call the `desc_set_defaults` function from the [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/irq/irqdesc.c). This function takes four parameters:
@ -275,7 +275,7 @@ desc->owner = owner;
... ...
``` ```
After this we go through the all [possible](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) processor with the [for_each_possible_cpu](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/cpumask.h#L714) helper and set the `kstat_irqs` to zero for the given interrupt descriptor: After this we go through the all [possible](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-2) processor with the [for_each_possible_cpu](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/cpumask.h#L714) helper and set the `kstat_irqs` to zero for the given interrupt descriptor:
```C ```C
for_each_possible_cpu(cpu) for_each_possible_cpu(cpu)
@ -367,7 +367,7 @@ if (nr_irqs > (NR_VECTORS * nr_cpu_ids))
nr = (gsi_top + nr_legacy_irqs()) + 8 * nr_cpu_ids; nr = (gsi_top + nr_legacy_irqs()) + 8 * nr_cpu_ids;
``` ```
Take a look on the `gsi_top` variable. Each `APIC` is identified with its own `ID` and with the offset where its `IRQ` starts. It is called `GSI` base or `Global System Interrupt` base. So the `gsi_top` represents it. We get the `Global System Interrupt` base from the [MultiProcessor Configuration Table](https://en.wikipedia.org/wiki/MultiProcessor_Specification) table (you can remember that we have parsed this table in the sixth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the Linux Kernel initialization process chapter). Take a look on the `gsi_top` variable. Each `APIC` is identified with its own `ID` and with the offset where its `IRQ` starts. It is called `GSI` base or `Global System Interrupt` base. So the `gsi_top` represents it. We get the `Global System Interrupt` base from the [MultiProcessor Configuration Table](https://en.wikipedia.org/wiki/MultiProcessor_Specification) table (you can remember that we have parsed this table in the sixth [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-6) of the Linux Kernel initialization process chapter).
After this we update the `nr` depends on the value of the `gsi_top`: After this we update the `nr` depends on the value of the `gsi_top`:
@ -413,7 +413,7 @@ if (WARN_ON(initcnt > IRQ_BITMAP_BITS))
initcnt = IRQ_BITMAP_BITS; initcnt = IRQ_BITMAP_BITS;
``` ```
where `IRQ_BITMAP_BITS` is equal to the `NR_IRQS` if the `CONFIG_SPARSE_IRQ` is not set and `NR_IRQS + 8196` in other way. In the next step we are going over all interrupt descriptors which need to be allocated in the loop and allocate space for the descriptor and insert to the `irq_desc_tree` [radix tree](https://0xax.gitbooks.io/linux-insides/content/DataStructures/linux-datastructures-2.html): where `IRQ_BITMAP_BITS` is equal to the `NR_IRQS` if the `CONFIG_SPARSE_IRQ` is not set and `NR_IRQS + 8196` in other way. In the next step we are going over all interrupt descriptors which need to be allocated in the loop and allocate space for the descriptor and insert to the `irq_desc_tree` [radix tree](https://0xax.gitbook.io/linux-insides/summary/datastructures/linux-datastructures-2):
```C ```C
for (i = 0; i < initcnt; i++) { for (i = 0; i < initcnt; i++) {
@ -434,7 +434,7 @@ That's all.
Conclusion Conclusion
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
It is the end of the seventh part of the [Interrupts and Interrupt Handling](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) chapter and we started to dive into external hardware interrupts in this part. We saw early initialization of the `irq_desc` structure which represents description of an external interrupt and contains information about it like list of irq actions, information about interrupt handler, interrupt's owner, count of the unhandled interrupt and etc. In the next part we will continue to research external interrupts. It is the end of the seventh part of the [Interrupts and Interrupt Handling](https://0xax.gitbook.io/linux-insides/summary/interrupts) chapter and we started to dive into external hardware interrupts in this part. We saw early initialization of the `irq_desc` structure which represents description of an external interrupt and contains information about it like list of irq actions, information about interrupt handler, interrupt's owner, count of the unhandled interrupt and etc. In the next part we will continue to research external interrupts.
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX). If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
@ -446,8 +446,8 @@ Links
* [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) * [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
* [numa](https://en.wikipedia.org/wiki/Non-uniform_memory_access) * [numa](https://en.wikipedia.org/wiki/Non-uniform_memory_access)
* [Enum type](https://en.wikipedia.org/wiki/Enumerated_type) * [Enum type](https://en.wikipedia.org/wiki/Enumerated_type)
* [cpumask](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) * [cpumask](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-2)
* [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) * [percpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1)
* [spinlock](https://en.wikipedia.org/wiki/Spinlock) * [spinlock](https://en.wikipedia.org/wiki/Spinlock)
* [critical section](https://en.wikipedia.org/wiki/Critical_section) * [critical section](https://en.wikipedia.org/wiki/Critical_section)
* [Lock validator](https://lwn.net/Articles/185666/) * [Lock validator](https://lwn.net/Articles/185666/)
@ -457,5 +457,5 @@ Links
* [Intel 8259](https://en.wikipedia.org/wiki/Intel_8259) * [Intel 8259](https://en.wikipedia.org/wiki/Intel_8259)
* [PIC](https://en.wikipedia.org/wiki/Programmable_Interrupt_Controller) * [PIC](https://en.wikipedia.org/wiki/Programmable_Interrupt_Controller)
* [MultiProcessor Configuration Table](https://en.wikipedia.org/wiki/MultiProcessor_Specification) * [MultiProcessor Configuration Table](https://en.wikipedia.org/wiki/MultiProcessor_Specification)
* [radix tree](https://0xax.gitbooks.io/linux-insides/content/DataStructures/linux-datastructures-2.html) * [radix tree](https://0xax.gitbook.io/linux-insides/summary/datastructures/linux-datastructures-2)
* [dmesg](https://en.wikipedia.org/wiki/Dmesg) * [dmesg](https://en.wikipedia.org/wiki/Dmesg)

View File

@ -4,9 +4,9 @@ Interrupts and Interrupt Handling. Part 8.
Non-early initialization of the IRQs Non-early initialization of the IRQs
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is the eighth part of the Interrupts and Interrupt Handling in the Linux kernel [chapter](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-7.html) we started to dive into the external hardware [interrupts](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29). We looked on the implementation of the `early_irq_init` function from the [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/irq/irqdesc.c) source code file and saw the initialization of the `irq_desc` structure in this function. Remind that `irq_desc` structure (defined in the [include/linux/irqdesc.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/irqdesc.h#L46) is the foundation of interrupt management code in the Linux kernel and represents an interrupt descriptor. In this part we will continue to dive into the initialization stuff which is related to the external hardware interrupts. This is the eighth part of the Interrupts and Interrupt Handling in the Linux kernel [chapter](https://0xax.gitbook.io/linux-insides/summary/interrupts) and in the previous [part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-7) we started to dive into the external hardware [interrupts](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29). We looked on the implementation of the `early_irq_init` function from the [kernel/irq/irqdesc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/irq/irqdesc.c) source code file and saw the initialization of the `irq_desc` structure in this function. Remind that `irq_desc` structure (defined in the [include/linux/irqdesc.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/irqdesc.h#L46) is the foundation of interrupt management code in the Linux kernel and represents an interrupt descriptor. In this part we will continue to dive into the initialization stuff which is related to the external hardware interrupts.
Right after the call of the `early_irq_init` function in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) we can see the call of the `init_IRQ` function. This function is architecture-specific and defined in the [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/irqinit.c). The `init_IRQ` function makes initialization of the `vector_irq` [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) variable that defined in the same [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/irqinit.c) source code file: Right after the call of the `early_irq_init` function in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) we can see the call of the `init_IRQ` function. This function is architecture-specific and defined in the [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/irqinit.c). The `init_IRQ` function makes initialization of the `vector_irq` [percpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1) variable that defined in the same [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/irqinit.c) source code file:
```C ```C
... ...
@ -22,13 +22,13 @@ and represents `percpu` array of the interrupt vector numbers. The `vector_irq_t
typedef int vector_irq_t[NR_VECTORS]; typedef int vector_irq_t[NR_VECTORS];
``` ```
where `NR_VECTORS` is count of the vector number and as you can remember from the first [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-1.html) of this chapter it is `256` for the [x86_64](https://en.wikipedia.org/wiki/X86-64): where `NR_VECTORS` is count of the vector number and as you can remember from the first [part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-1) of this chapter it is `256` for the [x86_64](https://en.wikipedia.org/wiki/X86-64):
```C ```C
#define NR_VECTORS 256 #define NR_VECTORS 256
``` ```
So, in the start of the `init_IRQ` function we fill the `vector_irq` [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) array with the vector number of the `legacy` interrupts: So, in the start of the `init_IRQ` function we fill the `vector_irq` [percpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1) array with the vector number of the `legacy` interrupts:
```C ```C
void __init init_IRQ(void) void __init init_IRQ(void)
@ -105,7 +105,7 @@ In the loop we are accessing the `vecto_irq` per-cpu array with the `per_cpu` ma
#define IRQ0_VECTOR ((FIRST_EXTERNAL_VECTOR + 16) & ~15) #define IRQ0_VECTOR ((FIRST_EXTERNAL_VECTOR + 16) & ~15)
``` ```
Why is `0x30` here? You can remember from the first [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-1.html) of this chapter that first 32 vector numbers from `0` to `31` are reserved by the processor and used for the processing of architecture-defined exceptions and interrupts. Vector numbers from `0x30` to `0x3f` are reserved for the [ISA](https://en.wikipedia.org/wiki/Industry_Standard_Architecture). So, it means that we fill the `vector_irq` from the `IRQ0_VECTOR` which is equal to the `32` to the `IRQ0_VECTOR + 16` (before the `0x30`). Why is `0x30` here? You can remember from the first [part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-1) of this chapter that first 32 vector numbers from `0` to `31` are reserved by the processor and used for the processing of architecture-defined exceptions and interrupts. Vector numbers from `0x30` to `0x3f` are reserved for the [ISA](https://en.wikipedia.org/wiki/Industry_Standard_Architecture). So, it means that we fill the `vector_irq` from the `IRQ0_VECTOR` which is equal to the `32` to the `IRQ0_VECTOR + 16` (before the `0x30`).
In the end of the `init_IRQ` function we can see the call of the following function: In the end of the `init_IRQ` function we can see the call of the following function:
@ -113,7 +113,7 @@ In the end of the `init_IRQ` function we can see the call of the following funct
x86_init.irqs.intr_init(); x86_init.irqs.intr_init();
``` ```
from the [arch/x86/kernel/x86_init.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/x86_init.c) source code file. If you have read [chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) about the Linux kernel initialization process, you can remember the `x86_init` structure. This structure contains a couple of files which are points to the function related to the platform setup (`x86_64` in our case), for example `resources` - related with the memory resources, `mpparse` - related with the parsing of the [MultiProcessor Configuration Table](https://en.wikipedia.org/wiki/MultiProcessor_Specification) table and etc.). As we can see the `x86_init` also contains the `irqs` field which contains three following fields: from the [arch/x86/kernel/x86_init.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/x86_init.c) source code file. If you have read [chapter](https://0xax.gitbook.io/linux-insides/summary/initialization) about the Linux kernel initialization process, you can remember the `x86_init` structure. This structure contains a couple of files which are points to the function related to the platform setup (`x86_64` in our case), for example `resources` - related with the memory resources, `mpparse` - related with the parsing of the [MultiProcessor Configuration Table](https://en.wikipedia.org/wiki/MultiProcessor_Specification) table and etc.). As we can see the `x86_init` also contains the `irqs` field which contains three following fields:
```C ```C
struct x86_init_ops x86_init __initdata struct x86_init_ops x86_init __initdata
@ -179,7 +179,7 @@ After this depends on the `CONFIG_X86_64` and `CONFIG_X86_LOCAL_APIC` kernel con
#endif #endif
``` ```
This function makes initialization of the [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) of `bootstrap processor` (or processor which starts first). It starts from the check that we found [SMP](https://en.wikipedia.org/wiki/Symmetric_multiprocessing) config (read more about it in the sixth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the Linux kernel initialization process chapter) and the processor has `APIC`: This function makes initialization of the [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) of `bootstrap processor` (or processor which starts first). It starts from the check that we found [SMP](https://en.wikipedia.org/wiki/Symmetric_multiprocessing) config (read more about it in the sixth [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-6) of the Linux kernel initialization process chapter) and the processor has `APIC`:
```C ```C
if (smp_found_config || !cpu_has_apic) if (smp_found_config || !cpu_has_apic)
@ -241,7 +241,7 @@ do { \
} while (0) } while (0)
``` ```
As we can see, first of all it expands to the call of the `alloc_system_vector` function that checks the given vector number in the `used_vectors` bitmap (read previous [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-7.html) about it) and if it is not set in the `used_vectors` bitmap we set it. After this we test that the `first_system_vector` is greater than given interrupt vector number and if it is greater we assign it: As we can see, first of all it expands to the call of the `alloc_system_vector` function that checks the given vector number in the `used_vectors` bitmap (read previous [part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-7) about it) and if it is not set in the `used_vectors` bitmap we set it. After this we test that the `first_system_vector` is greater than given interrupt vector number and if it is greater we assign it:
```C ```C
if (!test_bit(vector, used_vectors)) { if (!test_bit(vector, used_vectors)) {
@ -399,7 +399,7 @@ for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++)
set_bit(i, used_vectors); set_bit(i, used_vectors);
``` ```
You can remember how we did it in the sixth [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-6.html) of this chapter. You can remember how we did it in the sixth [part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-6) of this chapter.
In the end of the `native_init_IRQ` function we can see the following check: In the end of the `native_init_IRQ` function we can see the following check:
@ -509,7 +509,7 @@ That's all.
Conclusion Conclusion
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
It is the end of the eighth part of the [Interrupts and Interrupt Handling](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) chapter and we continued to dive into external hardware interrupts in this part. In the previous part we started to do it and saw early initialization of the `IRQs`. In this part we already saw non-early interrupts initialization in the `init_IRQ` function. We saw initialization of the `vector_irq` per-cpu array which is store vector numbers of the interrupts and will be used during interrupt handling and initialization of other stuff which is related to the external hardware interrupts. It is the end of the eighth part of the [Interrupts and Interrupt Handling](https://0xax.gitbook.io/linux-insides/summary/interrupts) chapter and we continued to dive into external hardware interrupts in this part. In the previous part we started to do it and saw early initialization of the `IRQs`. In this part we already saw non-early interrupts initialization in the `init_IRQ` function. We saw initialization of the `vector_irq` per-cpu array which is store vector numbers of the interrupts and will be used during interrupt handling and initialization of other stuff which is related to the external hardware interrupts.
In the next part we will continue to learn interrupts handling related stuff and will see initialization of the `softirqs`. In the next part we will continue to learn interrupts handling related stuff and will see initialization of the `softirqs`.
@ -521,7 +521,7 @@ Links
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
* [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) * [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
* [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) * [percpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1)
* [x86_64](https://en.wikipedia.org/wiki/X86-64) * [x86_64](https://en.wikipedia.org/wiki/X86-64)
* [Intel 8259](https://en.wikipedia.org/wiki/Intel_8259) * [Intel 8259](https://en.wikipedia.org/wiki/Intel_8259)
* [Programmable Interrupt Controller](https://en.wikipedia.org/wiki/Programmable_Interrupt_Controller) * [Programmable Interrupt Controller](https://en.wikipedia.org/wiki/Programmable_Interrupt_Controller)
@ -539,4 +539,4 @@ Links
* [Open Firmware](https://en.wikipedia.org/wiki/Open_Firmware) * [Open Firmware](https://en.wikipedia.org/wiki/Open_Firmware)
* [devicetree](https://en.wikipedia.org/wiki/Device_tree) * [devicetree](https://en.wikipedia.org/wiki/Device_tree)
* [RTC](https://en.wikipedia.org/wiki/Real-time_clock) * [RTC](https://en.wikipedia.org/wiki/Real-time_clock)
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-7.html) * [Previous part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-7)

View File

@ -4,7 +4,7 @@ Interrupts and Interrupt Handling. Part 9.
Introduction to deferred interrupts (Softirq, Tasklets and Workqueues) Introduction to deferred interrupts (Softirq, Tasklets and Workqueues)
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
It is the nine part of the Interrupts and Interrupt Handling in the Linux kernel [chapter](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) and in the previous [Previous part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-8.html) we saw implementation of the `init_IRQ` from that defined in the [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/irqinit.c) source code file. So, we will continue to dive into the initialization stuff which is related to the external hardware interrupts in this part. It is the nine part of the Interrupts and Interrupt Handling in the Linux kernel [chapter](https://0xax.gitbook.io/linux-insides/summary/interrupts) and in the previous [Previous part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-8) we saw implementation of the `init_IRQ` from that defined in the [arch/x86/kernel/irqinit.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/irqinit.c) source code file. So, we will continue to dive into the initialization stuff which is related to the external hardware interrupts in this part.
Interrupts may have different important characteristics and there are two among them: Interrupts may have different important characteristics and there are two among them:
@ -145,7 +145,7 @@ The `raise_softirq_irqoff` function marks the softirq as deffered by setting the
__raise_softirq_irqoff(nr); __raise_softirq_irqoff(nr);
``` ```
macro. After this, it checks the result of the `in_interrupt` that returns `irq_count` value. We already saw the `irq_count` in the first [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-1.html) of this chapter and it is used to check if a CPU is already on an interrupt stack or not. We just exit from the `raise_softirq_irqoff`, restore `IF` flag and enable interrupts on the local processor, if we are in the interrupt context, otherwise we call the `wakeup_softirqd`: macro. After this, it checks the result of the `in_interrupt` that returns `irq_count` value. We already saw the `irq_count` in the first [part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-1) of this chapter and it is used to check if a CPU is already on an interrupt stack or not. We just exit from the `raise_softirq_irqoff`, restore `IF` flag and enable interrupts on the local processor, if we are in the interrupt context, otherwise we call the `wakeup_softirqd`:
```C ```C
if (!in_interrupt()) if (!in_interrupt())
@ -231,7 +231,7 @@ void __init softirq_init(void)
} }
``` ```
We can see definition of the integer `cpu` variable at the beginning of the `softirq_init` function. Next we will use it as parameter for the `for_each_possible_cpu` macro that goes through the all possible processors in the system. If the `possible processor` is the new terminology for you, you can read more about it the [CPU masks](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) chapter. In short words, `possible cpus` is the set of processors that can be plugged in anytime during the life of that system boot. All `possible processors` stored in the `cpu_possible_bits` bitmap, you can find its definition in the [kernel/cpu.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/cpu.c): We can see definition of the integer `cpu` variable at the beginning of the `softirq_init` function. Next we will use it as parameter for the `for_each_possible_cpu` macro that goes through the all possible processors in the system. If the `possible processor` is the new terminology for you, you can read more about it the [CPU masks](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-2) chapter. In short words, `possible cpus` is the set of processors that can be plugged in anytime during the life of that system boot. All `possible processors` stored in the `cpu_possible_bits` bitmap, you can find its definition in the [kernel/cpu.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/cpu.c):
```C ```C
static DECLARE_BITMAP(cpu_possible_bits, CONFIG_NR_CPUS) __read_mostly; static DECLARE_BITMAP(cpu_possible_bits, CONFIG_NR_CPUS) __read_mostly;
@ -241,7 +241,7 @@ static DECLARE_BITMAP(cpu_possible_bits, CONFIG_NR_CPUS) __read_mostly;
const struct cpumask *const cpu_possible_mask = to_cpumask(cpu_possible_bits); const struct cpumask *const cpu_possible_mask = to_cpumask(cpu_possible_bits);
``` ```
Ok, we defined the integer `cpu` variable and go through the all possible processors with the `for_each_possible_cpu` macro and makes initialization of the two following [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) variables: Ok, we defined the integer `cpu` variable and go through the all possible processors with the `for_each_possible_cpu` macro and makes initialization of the two following [per-cpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1) variables:
* `tasklet_vec`; * `tasklet_vec`;
* `tasklet_hi_vec`; * `tasklet_hi_vec`;
@ -368,7 +368,7 @@ static void tasklet_action(struct softirq_action *a)
} }
``` ```
In the beginning of the `tasklet_action` function, we disable interrupts for the local processor with the help of the `local_irq_disable` macro (you can read about this macro in the second [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-2.html) of this chapter). In the next step, we take a head of the list that contains tasklets with normal priority and set this per-cpu list to `NULL` because all tasklets must be executed in a generally way. After this we enable interrupts for the local processor and go through the list of tasklets in the loop. In every iteration of the loop we call the `tasklet_trylock` function for the given tasklet that updates state of the given tasklet on `TASKLET_STATE_RUN`: In the beginning of the `tasklet_action` function, we disable interrupts for the local processor with the help of the `local_irq_disable` macro (you can read about this macro in the second [part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-2) of this chapter). In the next step, we take a head of the list that contains tasklets with normal priority and set this per-cpu list to `NULL` because all tasklets must be executed in a generally way. After this we enable interrupts for the local processor and go through the list of tasklets in the loop. In every iteration of the loop we call the `tasklet_trylock` function for the given tasklet that updates state of the given tasklet on `TASKLET_STATE_RUN`:
```C ```C
static inline int tasklet_trylock(struct tasklet_struct *t) static inline int tasklet_trylock(struct tasklet_struct *t)
@ -481,7 +481,7 @@ bool queue_work_on(int cpu, struct workqueue_struct *wq,
} }
``` ```
The `__queue_work` function gets the `work pool`. Yes, the `work pool` not `workqueue`. Actually, all `works` are not placed in the `workqueue`, but to the `work pool` that is represented by the `worker_pool` structure in the Linux kernel. As you can see above, the `workqueue_struct` structure has the `pwqs` field which is list of `worker_pools`. When we create a `workqueue`, it stands out for each processor the `pool_workqueue`. Each `pool_workqueue` associated with `worker_pool`, which is allocated on the same processor and corresponds to the type of priority queue. Through them `workqueue` interacts with `worker_pool`. So in the `__queue_work` function we set the cpu to the current processor with the `raw_smp_processor_id` (you can find information about this macro in the fourth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process chapter), getting the `pool_workqueue` for the given `workqueue_struct` and insert the given `work` to the given `workqueue`: The `__queue_work` function gets the `work pool`. Yes, the `work pool` not `workqueue`. Actually, all `works` are not placed in the `workqueue`, but to the `work pool` that is represented by the `worker_pool` structure in the Linux kernel. As you can see above, the `workqueue_struct` structure has the `pwqs` field which is list of `worker_pools`. When we create a `workqueue`, it stands out for each processor the `pool_workqueue`. Each `pool_workqueue` associated with `worker_pool`, which is allocated on the same processor and corresponds to the type of priority queue. Through them `workqueue` interacts with `worker_pool`. So in the `__queue_work` function we set the cpu to the current processor with the `raw_smp_processor_id` (you can find information about this macro in the fourth [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-4) of the Linux kernel initialization process chapter), getting the `pool_workqueue` for the given `workqueue_struct` and insert the given `work` to the given `workqueue`:
```C ```C
static void __queue_work(int cpu, struct workqueue_struct *wq, static void __queue_work(int cpu, struct workqueue_struct *wq,
@ -510,7 +510,7 @@ That's all.
Conclusion Conclusion
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
It is the end of the ninth part of the [Interrupts and Interrupt Handling](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) chapter and we continued to dive into external hardware interrupts in this part. In the previous part we saw initialization of the `IRQs` and main `irq_desc` structure. In this part we saw three concepts: the `softirq`, `tasklet` and `workqueue` that are used for the deferred functions. It is the end of the ninth part of the [Interrupts and Interrupt Handling](https://0xax.gitbook.io/linux-insides/summary/interrupts) chapter and we continued to dive into external hardware interrupts in this part. In the previous part we saw initialization of the `IRQs` and main `irq_desc` structure. In this part we saw three concepts: the `softirq`, `tasklet` and `workqueue` that are used for the deferred functions.
The next part will be last part of the `Interrupts and Interrupt Handling` chapter and we will look on the real hardware driver and will try to learn how it works with the interrupts subsystem. The next part will be last part of the `Interrupts and Interrupt Handling` chapter and we will look on the real hardware driver and will try to learn how it works with the interrupts subsystem.
@ -524,7 +524,7 @@ Links
* [initcall](https://kernelnewbies.org/Documents/InitcallMechanism) * [initcall](https://kernelnewbies.org/Documents/InitcallMechanism)
* [IF](https://en.wikipedia.org/wiki/Interrupt_flag) * [IF](https://en.wikipedia.org/wiki/Interrupt_flag)
* [eflags](https://en.wikipedia.org/wiki/FLAGS_register) * [eflags](https://en.wikipedia.org/wiki/FLAGS_register)
* [CPU masks](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) * [CPU masks](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-2)
* [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) * [per-cpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1)
* [Workqueue](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/workqueue.txt) * [Workqueue](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/workqueue.txt)
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-8.html) * [Previous part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-8)

View File

@ -4,13 +4,13 @@ Linux kernel memory management Part 1.
Introduction Introduction
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
Memory management is one of the most complex (and I think that it is the most complex) part of the operating system kernel. In the [last preparations before the kernel entry point](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html) part we stopped right before call of the `start_kernel` function. This function initializes all the kernel features (including architecture-dependent features) before the kernel runs the first `init` process. You may remember as we built early page tables, identity page tables and fixmap page tables in the boot time. No complicated memory management is working yet. When the `start_kernel` function is called we will see the transition to more complex data structures and techniques for memory management. For a good understanding of the initialization process in the linux kernel we need to have a clear understanding of these techniques. This chapter will provide an overview of the different parts of the linux kernel memory management framework and its API, starting from the `memblock`. Memory management is one of the most complex (and I think that it is the most complex) part of the operating system kernel. In the [last preparations before the kernel entry point](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-3) part we stopped right before call of the `start_kernel` function. This function initializes all the kernel features (including architecture-dependent features) before the kernel runs the first `init` process. You may remember as we built early page tables, identity page tables and fixmap page tables in the boot time. No complicated memory management is working yet. When the `start_kernel` function is called we will see the transition to more complex data structures and techniques for memory management. For a good understanding of the initialization process in the linux kernel we need to have a clear understanding of these techniques. This chapter will provide an overview of the different parts of the linux kernel memory management framework and its API, starting from the `memblock`.
Memblock Memblock
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
Memblock is one of the methods of managing memory regions during the early bootstrap period while the usual kernel memory allocators are not up and Memblock is one of the methods of managing memory regions during the early bootstrap period while the usual kernel memory allocators are not up and
running yet. Previously it was called `Logical Memory Block`, but with the [patch](https://lkml.org/lkml/2010/7/13/68) by Yinghai Lu, it was renamed to the `memblock`. As Linux kernel for `x86_64` architecture uses this method. We already met `memblock` in the [Last preparations before the kernel entry point](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html) part. And now it's time to get acquainted with it closer. We will see how it is implemented. running yet. Previously it was called `Logical Memory Block`, but with the [patch](https://lkml.org/lkml/2010/7/13/68) by Yinghai Lu, it was renamed to the `memblock`. As Linux kernel for `x86_64` architecture uses this method. We already met `memblock` in the [Last preparations before the kernel entry point](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-3) part. And now it's time to get acquainted with it closer. We will see how it is implemented.
We will start to learn `memblock` from the data structures. Definitions of all logical-memory-block-related data structures can be found in the [include/linux/memblock.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/memblock.h) header file. We will start to learn `memblock` from the data structures. Definitions of all logical-memory-block-related data structures can be found in the [include/linux/memblock.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/memblock.h) header file.
@ -163,7 +163,7 @@ This function takes a physical base address and the size of the memory region as
memblock_add_range(&memblock.memory, base, size, MAX_NUMNODES, 0); memblock_add_range(&memblock.memory, base, size, MAX_NUMNODES, 0);
``` ```
function. We pass the memory block type - `memory`, the physical base address and the size of the memory region, the maximum number of nodes which is 1 if `CONFIG_NODES_SHIFT` is not set in the configuration file or `1 << CONFIG_NODES_SHIFT` if it is set, and the flags. The `memblock_add_range` function adds a new memory region to the memory block. It starts by checking the size of the given region and if it is zero it just returns. After this, `memblock_add_range` checks the existence of the memory regions in the `memblock` structure with the given `memblock_type`. If there are no memory regions, we just fill a new `memory_region` with the given values and return (we already saw the implementation of this in the [First touch of the linux kernel memory manager framework](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html)). If `memblock_type` is not empty, we start to add a new memory region to the `memblock` with the given `memblock_type`. function. We pass the memory block type - `memory`, the physical base address and the size of the memory region, the maximum number of nodes which is 1 if `CONFIG_NODES_SHIFT` is not set in the configuration file or `1 << CONFIG_NODES_SHIFT` if it is set, and the flags. The `memblock_add_range` function adds a new memory region to the memory block. It starts by checking the size of the given region and if it is zero it just returns. After this, `memblock_add_range` checks the existence of the memory regions in the `memblock` structure with the given `memblock_type`. If there are no memory regions, we just fill a new `memory_region` with the given values and return (we already saw the implementation of this in the [First touch of the linux kernel memory manager framework](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-3)). If `memblock_type` is not empty, we start to add a new memory region to the `memblock` with the given `memblock_type`.
First of all we get the end of the memory region with the: First of all we get the end of the memory region with the:
@ -420,4 +420,4 @@ Links
* [e820](http://en.wikipedia.org/wiki/E820) * [e820](http://en.wikipedia.org/wiki/E820)
* [numa](http://en.wikipedia.org/wiki/Non-uniform_memory_access) * [numa](http://en.wikipedia.org/wiki/Non-uniform_memory_access)
* [debugfs](http://en.wikipedia.org/wiki/Debugfs) * [debugfs](http://en.wikipedia.org/wiki/Debugfs)
* [First touch of the linux kernel memory manager framework](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html) * [First touch of the linux kernel memory manager framework](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-3)

View File

@ -4,7 +4,7 @@ Linux kernel memory management Part 2.
Fix-Mapped Addresses and ioremap Fix-Mapped Addresses and ioremap
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
`Fix-Mapped` addresses are a set of special compile-time addresses whose corresponding physical addresses do not have to be a linear address minus `__START_KERNEL_map`. Each fix-mapped address maps one page frame and the kernel uses them as pointers that never change their address. That is the main point of these addresses. As the comment says: `to have a constant address at compile time, but to set the physical address only in the boot process`. You can remember that in the earliest [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html), we already set the `level2_fixmap_pgt`: `Fix-Mapped` addresses are a set of special compile-time addresses whose corresponding physical addresses do not have to be a linear address minus `__START_KERNEL_map`. Each fix-mapped address maps one page frame and the kernel uses them as pointers that never change their address. That is the main point of these addresses. As the comment says: `to have a constant address at compile time, but to set the physical address only in the boot process`. You can remember that in the earliest [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-1), we already set the `level2_fixmap_pgt`:
```assembly ```assembly
NEXT_PAGE(level2_fixmap_pgt) NEXT_PAGE(level2_fixmap_pgt)
@ -96,7 +96,7 @@ As in previous example (in `__fix_to_virt` macro), we start from the top of the
That's all. For this moment we know a little about `fix-mapped` addresses, but this is enough to go next. That's all. For this moment we know a little about `fix-mapped` addresses, but this is enough to go next.
`Fix-mapped` addresses are used in different [places](http://lxr.free-electrons.com/ident?i=fix_to_virt) in the linux kernel. `IDT` descriptor stored there, [Intel Trusted Execution Technology](http://en.wikipedia.org/wiki/Trusted_Execution_Technology) UUID stored in the `fix-mapped` area started from `FIX_TBOOT_BASE` index, [Xen](http://en.wikipedia.org/wiki/Xen) bootmap and many more... We already saw a little about `fix-mapped` addresses in the fifth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) about of the linux kernel initialization. We use `fix-mapped` area in the early `ioremap` initialization. Let's look at it more closely and try to understand what `ioremap` is, how it is implemented in the kernel and how it is related to the `fix-mapped` addresses. `Fix-mapped` addresses are used in different [places](http://lxr.free-electrons.com/ident?i=fix_to_virt) in the linux kernel. `IDT` descriptor stored there, [Intel Trusted Execution Technology](http://en.wikipedia.org/wiki/Trusted_Execution_Technology) UUID stored in the `fix-mapped` area started from `FIX_TBOOT_BASE` index, [Xen](http://en.wikipedia.org/wiki/Xen) bootmap and many more... We already saw a little about `fix-mapped` addresses in the fifth [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-5) about of the linux kernel initialization. We use `fix-mapped` area in the early `ioremap` initialization. Let's look at it more closely and try to understand what `ioremap` is, how it is implemented in the kernel and how it is related to the `fix-mapped` addresses.
ioremap ioremap
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
@ -149,7 +149,7 @@ As we can see it takes three parameters:
* `n` - length of region; * `n` - length of region;
* `name` - name of requester. * `name` - name of requester.
`request_region` allocates an `I/O` port region. Very often the `check_region` function is called before the `request_region` to check that the given address range is available and the `release_region` function to release the memory region. `request_region` returns a pointer to the `resource` structure. The `resource` structure represents an abstraction for a tree-like subset of system resources. We already saw the `resource` structure in the fifth part of the kernel [initialization](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) process and it looks as follows: `request_region` allocates an `I/O` port region. Very often the `check_region` function is called before the `request_region` to check that the given address range is available and the `release_region` function to release the memory region. `request_region` returns a pointer to the `resource` structure. The `resource` structure represents an abstraction for a tree-like subset of system resources. We already saw the `resource` structure in the fifth part of the kernel [initialization](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-5) process and it looks as follows:
```C ```C
struct resource { struct resource {
@ -274,13 +274,13 @@ static inline const char *e820_type_to_string(int e820_type)
and we can see them in the `/proc/iomem` (read above). and we can see them in the `/proc/iomem` (read above).
Now let's try to understand how `ioremap` works. We already know a little about `ioremap`, we saw it in the fifth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) about linux kernel initialization. If you have read this part, you can remember the call of the `early_ioremap_init` function from the [arch/x86/mm/ioremap.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/mm/ioremap.c). Initialization of the `ioremap` is split into two parts: there is the early part which we can use before the normal `ioremap` is available and the normal `ioremap` which is available after `vmalloc` initialization and the call of `paging_init`. We do not know anything about `vmalloc` for now, so let's consider early initialization of the `ioremap`. First of all `early_ioremap_init` checks that `fixmap` is aligned on page middle directory boundary: Now let's try to understand how `ioremap` works. We already know a little about `ioremap`, we saw it in the fifth [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-5) about linux kernel initialization. If you have read this part, you can remember the call of the `early_ioremap_init` function from the [arch/x86/mm/ioremap.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/mm/ioremap.c). Initialization of the `ioremap` is split into two parts: there is the early part which we can use before the normal `ioremap` is available and the normal `ioremap` which is available after `vmalloc` initialization and the call of `paging_init`. We do not know anything about `vmalloc` for now, so let's consider early initialization of the `ioremap`. First of all `early_ioremap_init` checks that `fixmap` is aligned on page middle directory boundary:
```C ```C
BUILD_BUG_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1)); BUILD_BUG_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1));
``` ```
more about `BUILD_BUG_ON` you can read in the first part about [Linux Kernel initialization](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html). So `BUILD_BUG_ON` macro raises a compilation error if the given expression is true. In the next step after this check, we can see call of the `early_ioremap_setup` function from the [mm/early_ioremap.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/mm/early_ioremap.c). This function presents generic initialization of the `ioremap`. `early_ioremap_setup` function fills the `slot_virt` array with the virtual addresses of the early fixmaps. All early fixmaps are after `__end_of_permanent_fixed_addresses` in memory. They start at `FIX_BITMAP_BEGIN` (top) and end with `FIX_BITMAP_END` (down). Actually there are `512` temporary boot-time mappings, used by early `ioremap`: more about `BUILD_BUG_ON` you can read in the first part about [Linux Kernel initialization](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-1). So `BUILD_BUG_ON` macro raises a compilation error if the given expression is true. In the next step after this check, we can see call of the `early_ioremap_setup` function from the [mm/early_ioremap.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/mm/early_ioremap.c). This function presents generic initialization of the `ioremap`. `early_ioremap_setup` function fills the `slot_virt` array with the virtual addresses of the early fixmaps. All early fixmaps are after `__end_of_permanent_fixed_addresses` in memory. They start at `FIX_BITMAP_BEGIN` (top) and end with `FIX_BITMAP_END` (down). Actually there are `512` temporary boot-time mappings, used by early `ioremap`:
``` ```
#define NR_FIX_BTMAPS 64 #define NR_FIX_BTMAPS 64
@ -335,7 +335,7 @@ pmd_populate_kernel(&init_mm, pmd, bm_pte);
`pmd_populate_kernel` takes three parameters: `pmd_populate_kernel` takes three parameters:
* `init_mm` - memory descriptor of the `init` process (you can read about it in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html)); * `init_mm` - memory descriptor of the `init` process (you can read about it in the previous [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-5));
* `pmd` - page middle directory of the beginning of the `ioremap` fixmaps; * `pmd` - page middle directory of the beginning of the `ioremap` fixmaps;
* `bm_pte` - early `ioremap` page table entries array which defined as: * `bm_pte` - early `ioremap` page table entries array which defined as:
@ -535,5 +535,5 @@ Links
* [e820](http://en.wikipedia.org/wiki/E820) * [e820](http://en.wikipedia.org/wiki/E820)
* [Memory management unit](http://en.wikipedia.org/wiki/Memory_management_unit) * [Memory management unit](http://en.wikipedia.org/wiki/Memory_management_unit)
* [TLB](http://en.wikipedia.org/wiki/Translation_lookaside_buffer) * [TLB](http://en.wikipedia.org/wiki/Translation_lookaside_buffer)
* [Paging](https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-1.html) * [Paging](https://0xax.gitbook.io/linux-insides/summary/theory/linux-theory-1)
* [Linux kernel memory management Part 1.](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-1.html) * [Linux kernel memory management Part 1.](https://0xax.gitbook.io/linux-insides/summary/mm/linux-mm-1)

View File

@ -4,7 +4,7 @@ Linux kernel memory management Part 3.
Introduction to the kmemcheck in the Linux kernel Introduction to the kmemcheck in the Linux kernel
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is the third part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/MM/) which describes [memory management](https://en.wikipedia.org/wiki/Memory_management) in the Linux kernel and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-2.html) of this chapter we met two memory management related concepts: This is the third part of the [chapter](https://0xax.gitbook.io/linux-insides/summary/mm) which describes [memory management](https://en.wikipedia.org/wiki/Memory_management) in the Linux kernel and in the previous [part](https://0xax.gitbook.io/linux-insides/summary/mm/linux-mm-2) of this chapter we met two memory management related concepts:
* `Fix-Mapped Addresses`; * `Fix-Mapped Addresses`;
* `ioremap`. * `ioremap`.
@ -61,11 +61,11 @@ $ sudo cat /proc/ioports
... ...
``` ```
can show us lists of currently registered port regions used for input or output communication with a device. All memory-mapped I/O addresses are not used by the kernel directly. So, before the Linux kernel can use such memory, it must map it to the virtual memory space which is the main purpose of the `ioremap` mechanism. Note that we saw only early `ioremap` in the previous [part](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-2.html). Soon we will look at the implementation of the non-early `ioremap` function. But before this we must learn other things, like a different types of memory allocators and etc., because in other way it will be very difficult to understand it. can show us lists of currently registered port regions used for input or output communication with a device. All memory-mapped I/O addresses are not used by the kernel directly. So, before the Linux kernel can use such memory, it must map it to the virtual memory space which is the main purpose of the `ioremap` mechanism. Note that we saw only early `ioremap` in the previous [part](https://0xax.gitbook.io/linux-insides/summary/mm/linux-mm-2). Soon we will look at the implementation of the non-early `ioremap` function. But before this we must learn other things, like a different types of memory allocators and etc., because in other way it will be very difficult to understand it.
So, before we will move on to the non-early [memory management](https://en.wikipedia.org/wiki/Memory_management) of the Linux kernel, we will see some mechanisms which provide special abilities for [debugging](https://en.wikipedia.org/wiki/Debugging), check of [memory leaks](https://en.wikipedia.org/wiki/Memory_leak), memory control and etc. It will be easier to understand how memory management arranged in the Linux kernel after learning of all of these things. So, before we will move on to the non-early [memory management](https://en.wikipedia.org/wiki/Memory_management) of the Linux kernel, we will see some mechanisms which provide special abilities for [debugging](https://en.wikipedia.org/wiki/Debugging), check of [memory leaks](https://en.wikipedia.org/wiki/Memory_leak), memory control and etc. It will be easier to understand how memory management arranged in the Linux kernel after learning of all of these things.
As you already may guess from the title of this part, we will start to consider memory mechanisms from the [kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt). As we always did in other [chapters](https://0xax.gitbooks.io/linux-insides/content/), we will start to consider from theoretical side and will learn what is `kmemcheck` mechanism in general and only after this, we will see how it is implemented in the Linux kernel. As you already may guess from the title of this part, we will start to consider memory mechanisms from the [kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt). As we always did in other [chapters](https://github.com/0xAX/linux-insides/blob/master/SUMMARY.md), we will start to consider from theoretical side and will learn what is `kmemcheck` mechanism in general and only after this, we will see how it is implemented in the Linux kernel.
So let's start. What is it `kmemcheck` in the Linux kernel? As you may guess from the name of this mechanism, the `kmemcheck` checks memory. That's true. Main point of the `kmemcheck` mechanism is to check that some kernel code accesses `uninitialized memory`. Let's take following simple [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) program: So let's start. What is it `kmemcheck` in the Linux kernel? As you may guess from the name of this mechanism, the `kmemcheck` checks memory. That's true. Main point of the `kmemcheck` mechanism is to check that some kernel code accesses `uninitialized memory`. Let's take following simple [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) program:
@ -148,7 +148,7 @@ Ok, so we know that `kmemcheck` provides mechanism to check usage of `uninitiali
struct my_struct *my_struct = kmalloc(sizeof(struct my_struct), GFP_KERNEL); struct my_struct *my_struct = kmalloc(sizeof(struct my_struct), GFP_KERNEL);
``` ```
or in other words somebody wants to access a [page](https://en.wikipedia.org/wiki/Page_%28computer_memory%29), a [page fault](https://en.wikipedia.org/wiki/Page_fault) exception is generated. This is achieved by the fact that the `kmemcheck` marks memory pages as `non-present` (more about this you can read in the special part which is devoted to [Paging](https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-1.html)). If a `page fault` exception is occurred, the exception handler knows about it and in a case when the `kmemcheck` is enabled it transfers control to it. After the `kmemcheck` will finish its checks, the page will be marked as `present` and the interrupted code will be able to continue execution. There is little subtlety in this chain. When the first instruction of interrupted code will be executed, the `kmemcheck` will mark the page as `non-present` again. In this way next access to memory will be caught again. or in other words somebody wants to access a [page](https://en.wikipedia.org/wiki/Page_%28computer_memory%29), a [page fault](https://en.wikipedia.org/wiki/Page_fault) exception is generated. This is achieved by the fact that the `kmemcheck` marks memory pages as `non-present` (more about this you can read in the special part which is devoted to [Paging](https://0xax.gitbook.io/linux-insides/summary/theory/linux-theory-1)). If a `page fault` exception is occurred, the exception handler knows about it and in a case when the `kmemcheck` is enabled it transfers control to it. After the `kmemcheck` will finish its checks, the page will be marked as `present` and the interrupted code will be able to continue execution. There is little subtlety in this chain. When the first instruction of interrupted code will be executed, the `kmemcheck` will mark the page as `non-present` again. In this way next access to memory will be caught again.
We just considered the `kmemcheck` mechanism from theoretical side. Now let's consider how it is implemented in the Linux kernel. We just considered the `kmemcheck` mechanism from theoretical side. Now let's consider how it is implemented in the Linux kernel.
@ -167,7 +167,7 @@ to the Linux kernel command line. The first two are clear, but the last needs a
![kernel configuration menu](images/kernel_configuration_menu2.png) ![kernel configuration menu](images/kernel_configuration_menu2.png)
We know from the seventh [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-7.html) of the chapter which describes initialization of the Linux kernel that the kernel command line is parsed during initialization of the Linux kernel in `do_initcall_level`, `do_early_param` functions. Actually the `kmemcheck` subsystem consists from two stages. The first stage is early. If we will look at the [mm/kmemcheck.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/mm/kmemcheck.c) source code file, we will see the `param_kmemcheck` function which is will be called during early command line parsing: We know from the seventh [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-7) of the chapter which describes initialization of the Linux kernel that the kernel command line is parsed during initialization of the Linux kernel in `do_initcall_level`, `do_early_param` functions. Actually the `kmemcheck` subsystem consists from two stages. The first stage is early. If we will look at the [mm/kmemcheck.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/mm/kmemcheck.c) source code file, we will see the `param_kmemcheck` function which is will be called during early command line parsing:
```C ```C
static int __init param_kmemcheck(char *str) static int __init param_kmemcheck(char *str)
@ -190,7 +190,7 @@ early_param("kmemcheck", param_kmemcheck);
As we already saw, the `param_kmemcheck` may have one of the following values: `0` (enabled), `1` (disabled) or `2` (one-shot). The implementation of the `param_kmemcheck` is pretty simple. We just convert string value of the `kmemcheck` command line option to integer representation and set it to the `kmemcheck_enabled` variable. As we already saw, the `param_kmemcheck` may have one of the following values: `0` (enabled), `1` (disabled) or `2` (one-shot). The implementation of the `param_kmemcheck` is pretty simple. We just convert string value of the `kmemcheck` command line option to integer representation and set it to the `kmemcheck_enabled` variable.
The second stage will be executed during initialization of the Linux kernel, rather during initialization of early [initcalls](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-3.html). The second stage is represented by the `kmemcheck_init`: The second stage will be executed during initialization of the Linux kernel, rather during initialization of early [initcalls](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-3). The second stage is represented by the `kmemcheck_init`:
```C ```C
int __init kmemcheck_init(void) int __init kmemcheck_init(void)
@ -278,7 +278,7 @@ void kmemcheck_hide_pages(struct page *p, unsigned int n)
Here we go through all pages and and tries to get `page table entry` for each page. If this operation was successful, we unset present bit and set hidden bit in each page. In the end we flush [translation lookaside buffer](https://en.wikipedia.org/wiki/Translation_lookaside_buffer), because some pages was changed. From this point allocated pages are tracked by the `kmemcheck`. Now, as `present` bit is unset, the [page fault](https://en.wikipedia.org/wiki/Page_fault) execution will be occurred right after the `kmalloc` will return pointer to allocated space and a code will try to access this memory. Here we go through all pages and and tries to get `page table entry` for each page. If this operation was successful, we unset present bit and set hidden bit in each page. In the end we flush [translation lookaside buffer](https://en.wikipedia.org/wiki/Translation_lookaside_buffer), because some pages was changed. From this point allocated pages are tracked by the `kmemcheck`. Now, as `present` bit is unset, the [page fault](https://en.wikipedia.org/wiki/Page_fault) execution will be occurred right after the `kmalloc` will return pointer to allocated space and a code will try to access this memory.
As you may remember from the [second part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) of the Linux kernel initialization chapter, the `page fault` handler is located in the [arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/mm/fault.c) source code file and represented by the `do_page_fault` function. We can see following check from the beginning of the `do_page_fault` function: As you may remember from the [second part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-2) of the Linux kernel initialization chapter, the `page fault` handler is located in the [arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/mm/fault.c) source code file and represented by the `do_page_fault` function. We can see following check from the beginning of the `do_page_fault` function:
```C ```C
static noinline void static noinline void
@ -296,7 +296,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
} }
``` ```
The `kmemcheck_active` gets `kmemcheck_context` [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) structure and return the result of comparison of the `balance` field of this structure with zero: The `kmemcheck_active` gets `kmemcheck_context` [per-cpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1) structure and return the result of comparison of the `balance` field of this structure with zero:
``` ```
bool kmemcheck_active(struct pt_regs *regs) bool kmemcheck_active(struct pt_regs *regs)
@ -337,7 +337,7 @@ Last two steps of the `kmemcheck_fault` function is to call the `kmemcheck_acces
static struct kmemcheck_error error_fifo[CONFIG_KMEMCHECK_QUEUE_SIZE]; static struct kmemcheck_error error_fifo[CONFIG_KMEMCHECK_QUEUE_SIZE];
``` ```
The `kmemcheck` mechanism declares special [tasklet](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-9.html): The `kmemcheck` mechanism declares special [tasklet](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-9):
```C ```C
static DECLARE_TASKLET(kmemcheck_tasklet, &do_wakeup, 0); static DECLARE_TASKLET(kmemcheck_tasklet, &do_wakeup, 0);
@ -422,12 +422,12 @@ Links
* [memory leaks](https://en.wikipedia.org/wiki/Memory_leak) * [memory leaks](https://en.wikipedia.org/wiki/Memory_leak)
* [kmemcheck documentation](https://www.kernel.org/doc/Documentation/kmemcheck.txt) * [kmemcheck documentation](https://www.kernel.org/doc/Documentation/kmemcheck.txt)
* [valgrind](https://en.wikipedia.org/wiki/Valgrind) * [valgrind](https://en.wikipedia.org/wiki/Valgrind)
* [Paging](https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-1.html) * [Paging](https://0xax.gitbook.io/linux-insides/summary/theory/linux-theory-1)
* [page fault](https://en.wikipedia.org/wiki/Page_fault) * [page fault](https://en.wikipedia.org/wiki/Page_fault)
* [initcalls](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-3.html) * [initcalls](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-3)
* [opcode](https://en.wikipedia.org/wiki/Opcode) * [opcode](https://en.wikipedia.org/wiki/Opcode)
* [translation lookaside buffer](https://en.wikipedia.org/wiki/Translation_lookaside_buffer) * [translation lookaside buffer](https://en.wikipedia.org/wiki/Translation_lookaside_buffer)
* [per-cpu variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) * [per-cpu variables](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1)
* [flags register](https://en.wikipedia.org/wiki/FLAGS_register) * [flags register](https://en.wikipedia.org/wiki/FLAGS_register)
* [tasklet](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-9.html) * [tasklet](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-9)
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-2.html) * [Previous part](https://0xax.gitbook.io/linux-insides/summary/mm/linux-mm-2)

View File

@ -8,7 +8,7 @@ As you already may know, I've started a series of [blog posts](https://0xax.gith
After writing some assembler code I began to understand how my program looks after compilation, **approximately**. But anyway, I didn't understand many other things. For example: what occurs when the `syscall` instruction is executed in my assembler, what occurs when the `printf` function starts to work or how can my program talk with other computers via network. [Assembler](https://en.wikipedia.org/wiki/Assembly_language#Assembler) programming language didn't give me answers to my questions and I decided to go deeper in my research. I started to learn from the source code of the Linux kernel and tried to understand the things that I'm interested in. The source code of the Linux kernel didn't give me the answers to **all** of my questions, but now my knowledge about the Linux kernel and the processes around it is much better. After writing some assembler code I began to understand how my program looks after compilation, **approximately**. But anyway, I didn't understand many other things. For example: what occurs when the `syscall` instruction is executed in my assembler, what occurs when the `printf` function starts to work or how can my program talk with other computers via network. [Assembler](https://en.wikipedia.org/wiki/Assembly_language#Assembler) programming language didn't give me answers to my questions and I decided to go deeper in my research. I started to learn from the source code of the Linux kernel and tried to understand the things that I'm interested in. The source code of the Linux kernel didn't give me the answers to **all** of my questions, but now my knowledge about the Linux kernel and the processes around it is much better.
I'm writing this part nine and a half months after I've started to learn from the source code of the Linux kernel and published the first [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-1.html) of this book. Now it contains forty parts and it is not the end. I decided to write this series about the Linux kernel mostly for myself. As you know the Linux kernel is very huge piece of code and it is easy to forget what does this or that part of the Linux kernel mean and how does it implement something. But soon the [linux-insides](https://github.com/0xAX/linux-insides) repo became popular and after nine months it has `9096` stars: I'm writing this part nine and a half months after I've started to learn from the source code of the Linux kernel and published the first [part](https://0xax.gitbook.io/linux-insides/summary/booting/linux-bootstrap-1) of this book. Now it contains forty parts and it is not the end. I decided to write this series about the Linux kernel mostly for myself. As you know the Linux kernel is very huge piece of code and it is easy to forget what does this or that part of the Linux kernel mean and how does it implement something. But soon the [linux-insides](https://github.com/0xAX/linux-insides) repo became popular and after nine months it has `9096` stars:
![github](images/github.png) ![github](images/github.png)
@ -291,7 +291,7 @@ $ git checkout master
$ git pull upstream master $ git pull upstream master
``` ```
After this my local repository with the Linux kernel source code is synced with the [mainline](https://github.com/torvalds/linux) repository. Now we can make some changes in the source code. As I already wrote, I have no advice for you where you can start and what `TODO` in the Linux kernel. But the best place for newbies is `staging` tree. In other words the set of drivers from the [drivers/staging](https://github.com/torvalds/linux/tree/master/drivers/staging). The maintainer of the `staging` tree is [Greg Kroah-Hartman](https://en.wikipedia.org/wiki/Greg_Kroah-Hartman) and the `staging` tree is that place where your trivial patch can be accepted. Let's look on a simple example that describes how to generate patch, check it and send to the [Linux kernel mail listing](https://lkml.org/). As soon as your local copy of the linux kernel source code is in sync with the [mainline](https://github.com/torvalds/linux) repository, we can start to apply changes to it. I already wrote, I have no advice for where you should start and what `TODO` to choose within the Linux kernel. But the best place for newbies is the `staging` tree. In other words the set of drivers from the [drivers/staging](https://github.com/torvalds/linux/tree/master/drivers/staging) directory. The maintainer of this tree is [Greg Kroah-Hartman](https://en.wikipedia.org/wiki/Greg_Kroah-Hartman) and the `staging` drivers are a good target for trivial patch fixes. Let's look at this simple example, that describes how to generate a patch, check it and send it to the [Linux kernel mail listing](https://lkml.org/).
If we look in the driver for the [Digi International EPCA PCI](https://github.com/torvalds/linux/tree/master/drivers/staging/dgap) based devices, we will see the `dgap_sindex` function on line 295: If we look in the driver for the [Digi International EPCA PCI](https://github.com/torvalds/linux/tree/master/drivers/staging/dgap) based devices, we will see the `dgap_sindex` function on line 295:

View File

@ -1,7 +1,7 @@
Introduction Introduction
--------------- ---------------
During the writing of the [linux-insides](https://0xax.gitbooks.io/linux-insides/content/) book I have received many emails with questions related to the [linker](https://en.wikipedia.org/wiki/Linker_%28computing%29) script and linker-related subjects. So I've decided to write this to cover some aspects of the linker and the linking of object files. During the writing of the [linux-insides](https://github.com/0xAX/linux-insides/blob/master/SUMMARY.md) book I have received many emails with questions related to the [linker](https://en.wikipedia.org/wiki/Linker_%28computing%29) script and linker-related subjects. So I've decided to write this to cover some aspects of the linker and the linking of object files.
If we open the `Linker` page on Wikipedia, we will see following definition: If we open the `Linker` page on Wikipedia, we will see following definition:
@ -569,7 +569,7 @@ Disassembly of section .data:
... ...
``` ```
Apart from the commands we have already seen, there are a few others. The first is the `ASSERT(exp, message)` that ensures that given expression is not zero. If it is zero, then exit the linker with an error code and print the given error message. If you've read about Linux kernel booting process in the [linux-insides](https://0xax.gitbooks.io/linux-insides/content/) book, you may know that the setup header of the Linux kernel has offset `0x1f1`. In the linker script of the Linux kernel we can find a check for this: Apart from the commands we have already seen, there are a few others. The first is the `ASSERT(exp, message)` that ensures that given expression is not zero. If it is zero, then exit the linker with an error code and print the given error message. If you've read about Linux kernel booting process in the [linux-insides](https://github.com/0xAX/linux-insides/blob/master/SUMMARY.md) book, you may know that the setup header of the Linux kernel has offset `0x1f1`. In the linker script of the Linux kernel we can find a check for this:
``` ```
. = ASSERT(hdr == 0x1f1, "The setup header has the wrong offset!"); . = ASSERT(hdr == 0x1f1, "The setup header has the wrong offset!");
@ -631,7 +631,7 @@ Please note that English is not my first language, and I am really sorry for any
Links Links
----------------- -----------------
* [Book about Linux kernel insides](https://0xax.gitbooks.io/linux-insides/content/) * [Book about Linux kernel insides](https://github.com/0xAX/linux-insides/blob/master/SUMMARY.md)
* [linker](https://en.wikipedia.org/wiki/Linker_%28computing%29) * [linker](https://en.wikipedia.org/wiki/Linker_%28computing%29)
* [object files](https://en.wikipedia.org/wiki/Object_file) * [object files](https://en.wikipedia.org/wiki/Object_file)
* [glibc](https://en.wikipedia.org/wiki/GNU_C_Library) * [glibc](https://en.wikipedia.org/wiki/GNU_C_Library)

View File

@ -6,7 +6,7 @@ Introduction
Despite the [linux-insides](https://www.gitbook.com/book/0xax/linux-insides/details) described mostly Linux kernel related stuff, I have decided to write this one part which mostly related to userspace. Despite the [linux-insides](https://www.gitbook.com/book/0xax/linux-insides/details) described mostly Linux kernel related stuff, I have decided to write this one part which mostly related to userspace.
There is already fourth [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-4.html) of [System calls](https://en.wikipedia.org/wiki/System_call) chapter which describes what does the Linux kernel do when we want to start a program. In this part I want to explore what happens when we run a program on a Linux machine from userspace perspective. There is already fourth [part](https://0xax.gitbook.io/linux-insides/summary/syscall/linux-syscall-4) of [System calls](https://en.wikipedia.org/wiki/System_call) chapter which describes what does the Linux kernel do when we want to start a program. In this part I want to explore what happens when we run a program on a Linux machine from userspace perspective.
I don't know how about you, but in my university I learn that a `C` program starts executing from the function which is called `main`. And that's partly true. Whenever we are starting to write new program, we start our program from the following lines of code: I don't know how about you, but in my university I learn that a `C` program starts executing from the function which is called `main`. And that's partly true. Whenever we are starting to write new program, we start our program from the following lines of code:
@ -123,7 +123,7 @@ Ok, everything looks pretty good up to now. You may already know that there is a
> The exec() family of functions replaces the current process image with a new process image. > The exec() family of functions replaces the current process image with a new process image.
All the `exec*` functions are simple frontends to the [execve](http://man7.org/linux/man-pages/man2/execve.2.html) system call. If you have read the fourth [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-4.html) of the chapter which describes [system calls](https://en.wikipedia.org/wiki/System_call), you may know that the [execve](http://linux.die.net/man/2/execve) system call is defined in the [files/exec.c](https://github.com/torvalds/linux/blob/08e4e0d0456d0ca8427b2d1ddffa30f1c3e774d7/fs/exec.c#L1888) source code file and looks like: All the `exec*` functions are simple frontends to the [execve](http://man7.org/linux/man-pages/man2/execve.2.html) system call. If you have read the fourth [part](https://0xax.gitbook.io/linux-insides/summary/syscall/linux-syscall-4) of the chapter which describes [system calls](https://en.wikipedia.org/wiki/System_call), you may know that the [execve](http://linux.die.net/man/2/execve) system call is defined in the [files/exec.c](https://github.com/torvalds/linux/blob/08e4e0d0456d0ca8427b2d1ddffa30f1c3e774d7/fs/exec.c#L1888) source code file and looks like:
```C ```C
SYSCALL_DEFINE3(execve, SYSCALL_DEFINE3(execve,
@ -135,7 +135,7 @@ SYSCALL_DEFINE3(execve,
} }
``` ```
It takes an executable file name, set of command line arguments, and set of enviroment variables. As you may guess, everything is done by the `do_execve` function. I will not describe the implementation of the `do_execve` function in detail because you can read about this in [here](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-4.html). But in short words, the `do_execve` function does many checks like `filename` is valid, limit of launched processes is not exceed in our system and etc. After all of these checks, this function parses our executable file which is represented in [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) format, creates memory descriptor for newly executed executable file and fills it with the appropriate values like area for the stack, heap and etc. When the setup of new binary image is done, the `start_thread` function will set up one new process. This function is architecture-specific and for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture, its definition will be located in the [arch/x86/kernel/process_64.c](https://github.com/torvalds/linux/blob/08e4e0d0456d0ca8427b2d1ddffa30f1c3e774d7/arch/x86/kernel/process_64.c#L239) source code file. It takes an executable file name, set of command line arguments, and set of enviroment variables. As you may guess, everything is done by the `do_execve` function. I will not describe the implementation of the `do_execve` function in detail because you can read about this in [here](https://0xax.gitbook.io/linux-insides/summary/syscall/linux-syscall-4). But in short words, the `do_execve` function does many checks like `filename` is valid, limit of launched processes is not exceed in our system and etc. After all of these checks, this function parses our executable file which is represented in [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) format, creates memory descriptor for newly executed executable file and fills it with the appropriate values like area for the stack, heap and etc. When the setup of new binary image is done, the `start_thread` function will set up one new process. This function is architecture-specific and for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture, its definition will be located in the [arch/x86/kernel/process_64.c](https://github.com/torvalds/linux/blob/08e4e0d0456d0ca8427b2d1ddffa30f1c3e774d7/arch/x86/kernel/process_64.c#L239) source code file.
The `start_thread` function sets new value to [segment registers](https://en.wikipedia.org/wiki/X86_memory_segmentation) and program execution address. From this point, our new process is ready to start. Once the [context switch](https://en.wikipedia.org/wiki/Context_switch) will be done, control will be returned to userspace with new values of registers and the new executable will be started to execute. The `start_thread` function sets new value to [segment registers](https://en.wikipedia.org/wiki/X86_memory_segmentation) and program execution address. From this point, our new process is ready to start. Once the [context switch](https://en.wikipedia.org/wiki/Context_switch) will be done, control will be returned to userspace with new values of registers and the new executable will be started to execute.

Binary file not shown.

View File

@ -4,7 +4,7 @@ Synchronization primitives in the Linux kernel. Part 1.
Introduction Introduction
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This part opens a new chapter in the [linux-insides](https://0xax.gitbooks.io/linux-insides/content/) book. Timers and time management related stuff was described in the previous [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html). Now it's time to move on to the next topic. As you probably recognized from the title, this chapter will describe the [synchronization](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) primitives in the Linux kernel. This part opens a new chapter in the [linux-insides](https://github.com/0xAX/linux-insides/blob/master/SUMMARY.md) book. Timers and time management related stuff was described in the previous [chapter](https://0xax.gitbook.io/linux-insides/summary/timers/). Now it's time to move on to the next topic. As you probably recognized from the title, this chapter will describe the [synchronization](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) primitives in the Linux kernel.
As always, we will try to know what a `synchronization primitive` in general is before we deal with any synchronization-related issues. Actually, a synchronization primitive is a software mechanism, that ensures that two or more [parallel](https://en.wikipedia.org/wiki/Parallel_computing) processes or threads are not running simultaneously on the same code segment. For example, let's look at the following piece of code: As always, we will try to know what a `synchronization primitive` in general is before we deal with any synchronization-related issues. Actually, a synchronization primitive is a software mechanism, that ensures that two or more [parallel](https://en.wikipedia.org/wiki/Parallel_computing) processes or threads are not running simultaneously on the same code segment. For example, let's look at the following piece of code:
@ -22,7 +22,7 @@ clocksource_select();
mutex_unlock(&clocksource_mutex); mutex_unlock(&clocksource_mutex);
``` ```
from the [kernel/time/clocksource.c](https://github.com/torvalds/linux/blob/master/kernel/time/clocksource.c) source code file. This code is from the `__clocksource_register_scale` function which adds the given [clocksource](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-2.html) to the clock sources list. This function produces different operations on a list with registered clock sources. For example, the `clocksource_enqueue` function adds the given clock source to the list with registered clocksources - `clocksource_list`. Note that these lines of code wrapped to two functions: `mutex_lock` and `mutex_unlock` which takes one parameter - the `clocksource_mutex` in our case. from the [kernel/time/clocksource.c](https://github.com/torvalds/linux/blob/master/kernel/time/clocksource.c) source code file. This code is from the `__clocksource_register_scale` function which adds the given [clocksource](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-2) to the clock sources list. This function produces different operations on a list with registered clock sources. For example, the `clocksource_enqueue` function adds the given clock source to the list with registered clocksources - `clocksource_list`. Note that these lines of code wrapped to two functions: `mutex_lock` and `mutex_unlock` which takes one parameter - the `clocksource_mutex` in our case.
These functions represent locking and unlocking based on [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) synchronization primitive. As `mutex_lock` will be executed, it allows us to prevent the situation when two or more threads will execute this code while the `mutex_unlock` will not be executed by process-owner of the mutex. In other words, we prevent parallel operations on a `clocksource_list`. Why do we need `mutex` here? What if two parallel processes will try to register a clock source. As we already know, the `clocksource_enqueue` function adds the given clock source to the `clocksource_list` list right after a clock source in the list which has the biggest rating (a registered clock source which has the highest frequency in the system): These functions represent locking and unlocking based on [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) synchronization primitive. As `mutex_lock` will be executed, it allows us to prevent the situation when two or more threads will execute this code while the `mutex_unlock` will not be executed by process-owner of the mutex. In other words, we prevent parallel operations on a `clocksource_list`. Why do we need `mutex` here? What if two parallel processes will try to register a clock source. As we already know, the `clocksource_enqueue` function adds the given clock source to the `clocksource_list` list right after a clock source in the list which has the biggest rating (a registered clock source which has the highest frequency in the system):
@ -43,7 +43,7 @@ static void clocksource_enqueue(struct clocksource *cs)
If two parallel processes will try to do it simultaneously, both process may found the same `entry` may occur [race condition](https://en.wikipedia.org/wiki/Race_condition) or in other words, the second process which will execute `list_add`, will overwrite a clock source from the first thread. If two parallel processes will try to do it simultaneously, both process may found the same `entry` may occur [race condition](https://en.wikipedia.org/wiki/Race_condition) or in other words, the second process which will execute `list_add`, will overwrite a clock source from the first thread.
Besides this simple example, synchronization primitives are ubiquitous in the Linux kernel. If we will go through the previous [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) or other chapters again or if we will look at the Linux kernel source code in general, we will meet many places like this. We will not consider how `mutex` is implemented in the Linux kernel. Actually, the Linux kernel provides a set of different synchronization primitives like: Besides this simple example, synchronization primitives are ubiquitous in the Linux kernel. If we will go through the previous [chapter](https://0xax.gitbook.io/linux-insides/summary/timers/) or other chapters again or if we will look at the Linux kernel source code in general, we will meet many places like this. We will not consider how `mutex` is implemented in the Linux kernel. Actually, the Linux kernel provides a set of different synchronization primitives like:
* `mutex`; * `mutex`;
* `semaphores`; * `semaphores`;
@ -236,7 +236,7 @@ static inline void __raw_spin_lock(raw_spinlock_t *lock)
} }
``` ```
As you may see, first of all we disable [preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29) by the call of the `preempt_disable` macro from the [include/linux/preempt.h](https://github.com/torvalds/linux/blob/master/include/linux/preempt.h) (more about this you may read in the ninth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-9.html) of the Linux kernel initialization process chapter). When we unlock the given `spinlock`, preemption will be enabled again: As you may see, first of all we disable [preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29) by the call of the `preempt_disable` macro from the [include/linux/preempt.h](https://github.com/torvalds/linux/blob/master/include/linux/preempt.h) (more about this you may read in the ninth [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-9) of the Linux kernel initialization process chapter). When we unlock the given `spinlock`, preemption will be enabled again:
```C ```C
static inline void __raw_spin_unlock(raw_spinlock_t *lock) static inline void __raw_spin_unlock(raw_spinlock_t *lock)
@ -326,7 +326,7 @@ Links
* [Concurrent computing](https://en.wikipedia.org/wiki/Concurrent_computing) * [Concurrent computing](https://en.wikipedia.org/wiki/Concurrent_computing)
* [Synchronization](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) * [Synchronization](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29)
* [Clocksource framework](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-2.html) * [Clocksource framework](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-2)
* [Mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) * [Mutex](https://en.wikipedia.org/wiki/Mutual_exclusion)
* [Race condition](https://en.wikipedia.org/wiki/Race_condition) * [Race condition](https://en.wikipedia.org/wiki/Race_condition)
* [Atomic operations](https://en.wikipedia.org/wiki/Linearizability) * [Atomic operations](https://en.wikipedia.org/wiki/Linearizability)
@ -339,4 +339,4 @@ Links
* [xadd instruction](http://x86.renejeschke.de/html/file_module_x86_id_327.html) * [xadd instruction](http://x86.renejeschke.de/html/file_module_x86_id_327.html)
* [NOP](https://en.wikipedia.org/wiki/NOP) * [NOP](https://en.wikipedia.org/wiki/NOP)
* [Memory barriers](https://www.kernel.org/doc/Documentation/memory-barriers.txt) * [Memory barriers](https://www.kernel.org/doc/Documentation/memory-barriers.txt)
* [Previous chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) * [Previous chapter](https://0xax.gitbook.io/linux-insides/summary/timers/)

View File

@ -4,9 +4,9 @@ Synchronization primitives in the Linux kernel. Part 2.
Queued Spinlocks Queued Spinlocks
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is the second part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html) which describes synchronization primitives in the Linux kernel. In the first [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html) of this chapter we meet the first [spinlock](https://en.wikipedia.org/wiki/Spinlock). We will continue to learn about this synchronization primitive here. If you have read the previous part, you may remember that besides normal spinlocks, the Linux kernel provides a special type of `spinlocks` - `queued spinlocks`. Here we will try to understand what this concept represents. This is the second part of the [chapter](https://0xax.gitbook.io/linux-insides/summary/syncprim) which describes synchronization primitives in the Linux kernel. In the first [part](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-1) of this chapter we meet the first [spinlock](https://en.wikipedia.org/wiki/Spinlock). We will continue to learn about this synchronization primitive here. If you have read the previous part, you may remember that besides normal spinlocks, the Linux kernel provides a special type of `spinlocks` - `queued spinlocks`. Here we will try to understand what this concept represents.
We saw the [API](https://en.wikipedia.org/wiki/Application_programming_interface) of `spinlock` in the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html): We saw the [API](https://en.wikipedia.org/wiki/Application_programming_interface) of `spinlock` in the previous [part](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-1):
* `spin_lock_init` - produces initialization of the given `spinlock`; * `spin_lock_init` - produces initialization of the given `spinlock`;
* `spin_lock` - acquires given `spinlock`; * `spin_lock` - acquires given `spinlock`;
@ -79,7 +79,7 @@ The first thread will execute the `test_and_set` which will set the `lock` to `1
The topic of this part is `queued spinlocks`. This approach may help to solve both of these problems. The `queued spinlocks` allows each processor to spin while checking its own memory location. The basic principle of a queue-based spinlock can best be understood by studying a classic queue-based spinlock implementation called the [MCS](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) lock. Before we look at implementation of the `queued spinlocks` in the Linux kernel, we will try to understand how `MCS` lock works. The topic of this part is `queued spinlocks`. This approach may help to solve both of these problems. The `queued spinlocks` allows each processor to spin while checking its own memory location. The basic principle of a queue-based spinlock can best be understood by studying a classic queue-based spinlock implementation called the [MCS](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) lock. Before we look at implementation of the `queued spinlocks` in the Linux kernel, we will try to understand how `MCS` lock works.
The basic idea of the `MCS` lock is that a thread spins on a local variable and each processor in the system has its own copy of this variable (see the previous paragraph). In other words this concept is built on top of the [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) variables concept in the Linux kernel. The basic idea of the `MCS` lock is that a thread spins on a local variable and each processor in the system has its own copy of this variable (see the previous paragraph). In other words this concept is built on top of the [per-cpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1) variables concept in the Linux kernel.
When the first thread wants to acquire a lock, it registers itself in the `queue`. In other words it will be added to the special `queue` and will acquire lock, because it is free for now. When the second thread wants to acquire the same lock before the first thread release it, this thread adds its own copy of the lock variable into this `queue`. In this case the first thread will contain a `next` field which will point to the second thread. From this moment, the second thread will wait until the first thread release its lock and notify `next` thread about this event. The first thread will be deleted from the `queue` and the second thread will be owner of a lock. When the first thread wants to acquire a lock, it registers itself in the `queue`. In other words it will be added to the special `queue` and will acquire lock, because it is free for now. When the second thread wants to acquire the same lock before the first thread release it, this thread adds its own copy of the lock variable into this `queue`. In this case the first thread will contain a `next` field which will point to the second thread. From this moment, the second thread will wait until the first thread release its lock and notify `next` thread about this event. The first thread will be deleted from the `queue` and the second thread will be owner of a lock.
@ -406,7 +406,7 @@ That's all.
Conclusion Conclusion
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is the end of the second part of the [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) chapter in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html) we already met the first synchronization primitive `spinlock` provided by the Linux kernel which is implemented as `ticket spinlock`. In this part we saw another implementation of the `spinlock` mechanism - `queued spinlock`. In the next part we will continue to dive into synchronization primitives in the Linux kernel. This is the end of the second part of the [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) chapter in the Linux kernel. In the previous [part](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-1) we already met the first synchronization primitive `spinlock` provided by the Linux kernel which is implemented as `ticket spinlock`. In this part we saw another implementation of the `spinlock` mechanism - `queued spinlock`. In the next part we will continue to dive into synchronization primitives in the Linux kernel.
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new). If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
@ -421,11 +421,11 @@ Links
* [API](https://en.wikipedia.org/wiki/Application_programming_interface) * [API](https://en.wikipedia.org/wiki/Application_programming_interface)
* [Test and Set](https://en.wikipedia.org/wiki/Test-and-set) * [Test and Set](https://en.wikipedia.org/wiki/Test-and-set)
* [MCS](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) * [MCS](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf)
* [per-cpu variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) * [per-cpu variables](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1)
* [atomic instruction](https://en.wikipedia.org/wiki/Linearizability) * [atomic instruction](https://en.wikipedia.org/wiki/Linearizability)
* [CMPXCHG instruction](http://x86.renejeschke.de/html/file_module_x86_id_41.html) * [CMPXCHG instruction](http://x86.renejeschke.de/html/file_module_x86_id_41.html)
* [LOCK instruction](http://x86.renejeschke.de/html/file_module_x86_id_159.html) * [LOCK instruction](http://x86.renejeschke.de/html/file_module_x86_id_159.html)
* [NOP instruction](https://en.wikipedia.org/wiki/NOP) * [NOP instruction](https://en.wikipedia.org/wiki/NOP)
* [PREFETCHW instruction](http://www.felixcloutier.com/x86/PREFETCHW.html) * [PREFETCHW instruction](http://www.felixcloutier.com/x86/PREFETCHW.html)
* [x86_64](https://en.wikipedia.org/wiki/X86-64) * [x86_64](https://en.wikipedia.org/wiki/X86-64)
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html) * [Previous part](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-1)

View File

@ -4,7 +4,7 @@ Synchronization primitives in the Linux kernel. Part 3.
Semaphores Semaphores
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is the third part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html) which describes synchronization primitives in the Linux kernel and in the previous part we saw special type of [spinlocks](https://en.wikipedia.org/wiki/Spinlock) - `queued spinlocks`. The previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-2.html) was the last part which describes `spinlocks` related stuff. So we need to go ahead. This is the third part of the [chapter](https://0xax.gitbook.io/linux-insides/summary/syncprim) which describes synchronization primitives in the Linux kernel and in the previous part we saw special type of [spinlocks](https://en.wikipedia.org/wiki/Spinlock) - `queued spinlocks`. The previous [part](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-2) was the last part which describes `spinlocks` related stuff. So we need to go ahead.
The next [synchronization primitive](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) after `spinlock` which we will see in this part is [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29). We will start from theoretical side and will learn what is it `semaphore` and only after this, we will see how it is implemented in the Linux kernel as we did in the previous part. The next [synchronization primitive](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) after `spinlock` which we will see in this part is [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29). We will start from theoretical side and will learn what is it `semaphore` and only after this, we will see how it is implemented in the Linux kernel as we did in the previous part.
@ -70,13 +70,13 @@ as we may see, the `DEFINE_SEMAPHORE` macro provides ability to initialize only
} }
``` ```
The `__SEMAPHORE_INITIALIZER` macro takes the name of the future `semaphore` structure and does initialization of the fields of this structure. First of all we initialize a `spinlock` of the given `semaphore` with the `__RAW_SPIN_LOCK_UNLOCKED` macro. As you may remember from the [previous](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html) parts, the `__RAW_SPIN_LOCK_UNLOCKED` is defined in the [include/linux/spinlock_types.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/spinlock_types.h) header file and expands to the `__ARCH_SPIN_LOCK_UNLOCKED` macro which just expands to zero or unlocked state: The `__SEMAPHORE_INITIALIZER` macro takes the name of the future `semaphore` structure and does initialization of the fields of this structure. First of all we initialize a `spinlock` of the given `semaphore` with the `__RAW_SPIN_LOCK_UNLOCKED` macro. As you may remember from the [previous](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-1) parts, the `__RAW_SPIN_LOCK_UNLOCKED` is defined in the [include/linux/spinlock_types.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/spinlock_types.h) header file and expands to the `__ARCH_SPIN_LOCK_UNLOCKED` macro which just expands to zero or unlocked state:
```C ```C
#define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } } #define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } }
``` ```
The last two fields of the `semaphore` structure `count` and `wait_list` are initialized with the given value which represents count of available resources and empty [list](https://0xax.gitbooks.io/linux-insides/content/DataStructures/linux-datastructures-1.html). The last two fields of the `semaphore` structure `count` and `wait_list` are initialized with the given value which represents count of available resources and empty [list](https://0xax.gitbook.io/linux-insides/summary/datastructures/linux-datastructures-1).
The second way to initialize a `semaphore` structure is to pass the `semaphore` and number of available resources to the `sema_init` function which is defined in the [include/linux/semaphore.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/semaphore.h) header file: The second way to initialize a `semaphore` structure is to pass the `semaphore` and number of available resources to the `sema_init` function which is defined in the [include/linux/semaphore.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/semaphore.h) header file:
@ -89,7 +89,7 @@ static inline void sema_init(struct semaphore *sem, int val)
} }
``` ```
Let's consider implementation of this function. It looks pretty easy and actually it does almost the same. Thus function executes initialization of the given `semaphore` with the `__SEMAPHORE_INITIALIZER` macro which we just saw. As I already wrote in the previous parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html), we will skip the stuff which is related to the [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) of the Linux kernel. Let's consider implementation of this function. It looks pretty easy and actually it does almost the same. Thus function executes initialization of the given `semaphore` with the `__SEMAPHORE_INITIALIZER` macro which we just saw. As I already wrote in the previous parts of this [chapter](https://0xax.gitbook.io/linux-insides/summary/syncprim), we will skip the stuff which is related to the [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) of the Linux kernel.
So, from now we are able to initialize a `semaphore` let's look at how to lock and unlock. The Linux kernel provides following [API](https://en.wikipedia.org/wiki/Application_programming_interface) to manipulate `semaphores`: So, from now we are able to initialize a `semaphore` let's look at how to lock and unlock. The Linux kernel provides following [API](https://en.wikipedia.org/wiki/Application_programming_interface) to manipulate `semaphores`:
@ -106,7 +106,7 @@ The first two functions: `down` and `up` are for acquiring and releasing of the
The `down_killable` function does the same as the `down_interruptible` function, but set the `TASK_KILLABLE` flag for the current process. This means that the waiting process may be interrupted by the kill signal. The `down_killable` function does the same as the `down_interruptible` function, but set the `TASK_KILLABLE` flag for the current process. This means that the waiting process may be interrupted by the kill signal.
The `down_trylock` function is similar on the `spin_trylock` function. This function tries to acquire a lock and exit if this operation was unsuccessful. In this case the process which wants to acquire a lock, will not wait. The last `down_timeout` function tries to acquire a lock. It will be interrupted in a waiting state when the given timeout will be expired. Additionally, you may notice that the timeout is in [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-1.html) The `down_trylock` function is similar on the `spin_trylock` function. This function tries to acquire a lock and exit if this operation was unsuccessful. In this case the process which wants to acquire a lock, will not wait. The last `down_timeout` function tries to acquire a lock. It will be interrupted in a waiting state when the given timeout will be expired. Additionally, you may notice that the timeout is in [jiffies](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-1)
We just saw definitions of the `semaphore` [API](https://en.wikipedia.org/wiki/Application_programming_interface). We will start from the `down` function. This function is defined in the [kernel/locking/semaphore.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/locking/semaphore.c) source code file. Let's look on the implementation function: We just saw definitions of the `semaphore` [API](https://en.wikipedia.org/wiki/Application_programming_interface). We will start from the `down` function. This function is defined in the [kernel/locking/semaphore.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/locking/semaphore.c) source code file. Let's look on the implementation function:
@ -184,7 +184,7 @@ The first represents current task for the local processor which wants to acquire
#define current get_current() #define current get_current()
``` ```
Where the `get_current` function returns value of the `current_task` [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) variable: Where the `get_current` function returns value of the `current_task` [per-cpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1) variable:
```C ```C
DECLARE_PER_CPU(struct task_struct *, current_task); DECLARE_PER_CPU(struct task_struct *, current_task);
@ -342,13 +342,13 @@ Links
* [preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29) * [preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29)
* [deadlocks](https://en.wikipedia.org/wiki/Deadlock) * [deadlocks](https://en.wikipedia.org/wiki/Deadlock)
* [scheduler](https://en.wikipedia.org/wiki/Scheduling_%28computing%29) * [scheduler](https://en.wikipedia.org/wiki/Scheduling_%28computing%29)
* [Doubly linked list in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/DataStructures/linux-datastructures-1.html) * [Doubly linked list in the Linux kernel](https://0xax.gitbook.io/linux-insides/summary/datastructures/linux-datastructures-1)
* [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-1.html) * [jiffies](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-1)
* [interrupts](https://en.wikipedia.org/wiki/Interrupt) * [interrupts](https://en.wikipedia.org/wiki/Interrupt)
* [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) * [per-cpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1)
* [bitmask](https://en.wikipedia.org/wiki/Mask_%28computing%29) * [bitmask](https://en.wikipedia.org/wiki/Mask_%28computing%29)
* [SIGKILL](https://en.wikipedia.org/wiki/Unix_signal#SIGKILL) * [SIGKILL](https://en.wikipedia.org/wiki/Unix_signal#SIGKILL)
* [errno](https://en.wikipedia.org/wiki/Errno.h) * [errno](https://en.wikipedia.org/wiki/Errno.h)
* [API](https://en.wikipedia.org/wiki/Application_programming_interface) * [API](https://en.wikipedia.org/wiki/Application_programming_interface)
* [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) * [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion)
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-2.html) * [Previous part](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-2)

View File

@ -4,16 +4,16 @@ Synchronization primitives in the Linux kernel. Part 4.
Introduction Introduction
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is the fourth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html) which describes synchronization primitives in the Linux kernel and in the previous parts we finished to consider different types [spinlocks](https://en.wikipedia.org/wiki/Spinlock) and [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) synchronization primitives. We will continue to learn [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) in this part and consider yet another one which is called - [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) which is stands for `MUTual EXclusion`. This is the fourth part of the [chapter](https://0xax.gitbook.io/linux-insides/summary/syncprim) which describes synchronization primitives in the Linux kernel and in the previous parts we finished to consider different types [spinlocks](https://en.wikipedia.org/wiki/Spinlock) and [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) synchronization primitives. We will continue to learn [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) in this part and consider yet another one which is called - [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) which is stands for `MUTual EXclusion`.
As in all previous parts of this [book](https://0xax.gitbooks.io/linux-insides/content), we will try to consider this synchronization primitive from the theoretical side and only than we will consider [API](https://en.wikipedia.org/wiki/Application_programming_interface) provided by the Linux kernel to manipulate with `mutexes`. As in all previous parts of this [book](https://github.com/0xAX/linux-insides/blob/master/SUMMARY.md), we will try to consider this synchronization primitive from the theoretical side and only than we will consider [API](https://en.wikipedia.org/wiki/Application_programming_interface) provided by the Linux kernel to manipulate with `mutexes`.
So, let's start. So, let's start.
Concept of `mutex` Concept of `mutex`
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
We already familiar with the [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) synchronization primitive from the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-3.html). It represented by the: We already familiar with the [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) synchronization primitive from the previous [part](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-3). It represented by the:
```C ```C
struct semaphore { struct semaphore {
@ -77,7 +77,7 @@ struct mutex_waiter {
}; };
``` ```
structure from the [include/linux/mutex.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/mutex.h) header file and will be sleep. Before we will consider [API](https://en.wikipedia.org/wiki/Application_programming_interface) which is provided by the Linux kernel for manipulation with `mutexes`, let's consider the `mutex_waiter` structure. If you have read the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-3.html) of this chapter, you may notice that the `mutex_waiter` structure is similar to the `semaphore_waiter` structure from the [kernel/locking/semaphore.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/locking/semaphore.c) source code file: structure from the [include/linux/mutex.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/mutex.h) header file and will be sleep. Before we will consider [API](https://en.wikipedia.org/wiki/Application_programming_interface) which is provided by the Linux kernel for manipulation with `mutexes`, let's consider the `mutex_waiter` structure. If you have read the [previous part](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-3) of this chapter, you may notice that the `mutex_waiter` structure is similar to the `semaphore_waiter` structure from the [kernel/locking/semaphore.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/locking/semaphore.c) source code file:
```C ```C
struct semaphore_waiter { struct semaphore_waiter {
@ -114,7 +114,7 @@ macro. Let's consider implementation of this macro. As we may see, the `DEFINE_M
} }
``` ```
This macro is defined in the [same](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/mutex.h) header file and as we may understand it initializes fields of the `mutex` structure the initial values. The `count` field get initialized with the `1` which represents `unlocked` state of a mutex. The `wait_lock` [spinlock](https://en.wikipedia.org/wiki/Spinlock) get initialized to the unlocked state and the last field `wait_list` to empty [doubly linked list](https://0xax.gitbooks.io/linux-insides/content/DataStructures/linux-datastructures-1.html). This macro is defined in the [same](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/mutex.h) header file and as we may understand it initializes fields of the `mutex` structure the initial values. The `count` field get initialized with the `1` which represents `unlocked` state of a mutex. The `wait_lock` [spinlock](https://en.wikipedia.org/wiki/Spinlock) get initialized to the unlocked state and the last field `wait_list` to empty [doubly linked list](https://0xax.gitbook.io/linux-insides/summary/datastructures/linux-datastructures-1).
The second approach allows us to initialize a `mutex` dynamically. To do this we need to call the `__mutex_init` function from the [kernel/locking/mutex.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/locking/mutex.c) source code file. Actually, the `__mutex_init` function rarely called directly. Instead of the `__mutex_init`, the: The second approach allows us to initialize a `mutex` dynamically. To do this we need to call the `__mutex_init` function from the [kernel/locking/mutex.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/locking/mutex.c) source code file. Actually, the `__mutex_init` function rarely called directly. Instead of the `__mutex_init`, the:
@ -159,7 +159,7 @@ static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
} }
``` ```
In the end of the `__mutex_init` function we may see the call of the `debug_mutex_init` function, but as I already wrote in previous parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html), we will not consider debugging related stuff in this chapter. In the end of the `__mutex_init` function we may see the call of the `debug_mutex_init` function, but as I already wrote in previous parts of this [chapter](https://0xax.gitbook.io/linux-insides/summary/syncprim), we will not consider debugging related stuff in this chapter.
After the `mutex` structure is initialized, we may go ahead and will look at the `lock` and `unlock` API of `mutex` synchronization primitive. Implementation of `mutex_lock` and `mutex_unlock` functions located in the [kernel/locking/mutex.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/locking/mutex.c) source code file. First of all let's start from the implementation of the `mutex_lock`. It looks: After the `mutex` structure is initialized, we may go ahead and will look at the `lock` and `unlock` API of `mutex` synchronization primitive. Implementation of `mutex_lock` and `mutex_unlock` functions located in the [kernel/locking/mutex.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/locking/mutex.c) source code file. First of all let's start from the implementation of the `mutex_lock`. It looks:
@ -176,7 +176,7 @@ We may see the call of the `might_sleep` macro from the [include/linux/kernel.h]
After the `might_sleep` macro, we may see the call of the `__mutex_fastpath_lock` function. This function is architecture-specific and as we consider [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture in this book, the implementation of the `__mutex_fastpath_lock` is located in the [arch/x86/include/asm/mutex_64.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/mutex_64.h) header file. As we may understand from the name of the `__mutex_fastpath_lock` function, this function will try to acquire lock in a fast path or in other words this function will try to decrement the value of the `count` of the given mutex. After the `might_sleep` macro, we may see the call of the `__mutex_fastpath_lock` function. This function is architecture-specific and as we consider [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture in this book, the implementation of the `__mutex_fastpath_lock` is located in the [arch/x86/include/asm/mutex_64.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/include/asm/mutex_64.h) header file. As we may understand from the name of the `__mutex_fastpath_lock` function, this function will try to acquire lock in a fast path or in other words this function will try to decrement the value of the `count` of the given mutex.
Implementation of the `__mutex_fastpath_lock` function consists from two parts. The first part is [inline assembly](https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-3.html) statement. Let's look at it: Implementation of the `__mutex_fastpath_lock` function consists from two parts. The first part is [inline assembly](https://0xax.gitbook.io/linux-insides/summary/theory/linux-theory-3) statement. Let's look at it:
```C ```C
asm_volatile_goto(LOCK_PREFIX " decl %0\n" asm_volatile_goto(LOCK_PREFIX " decl %0\n"
@ -403,7 +403,7 @@ That's all. We have considered main `API` for manipulation with `mutexes`: `mute
* `mutex_lock_killable`; * `mutex_lock_killable`;
* `mutex_trylock`. * `mutex_trylock`.
and corresponding versions of `unlock` prefixed functions. This part will not describe this `API`, because it is similar to corresponding `API` of `semaphores`. More about it you may read in the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-3.html). and corresponding versions of `unlock` prefixed functions. This part will not describe this `API`, because it is similar to corresponding `API` of `semaphores`. More about it you may read in the [previous part](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-3).
That's all. That's all.
@ -429,12 +429,12 @@ Links
* [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) * [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt)
* [Atomic](https://en.wikipedia.org/wiki/Linearizability) * [Atomic](https://en.wikipedia.org/wiki/Linearizability)
* [MCS lock](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) * [MCS lock](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf)
* [Doubly linked list](https://0xax.gitbooks.io/linux-insides/content/DataStructures/linux-datastructures-1.html) * [Doubly linked list](https://0xax.gitbook.io/linux-insides/summary/datastructures/linux-datastructures-1)
* [x86_64](https://en.wikipedia.org/wiki/X86-64) * [x86_64](https://en.wikipedia.org/wiki/X86-64)
* [Inline assembly](https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-3.html) * [Inline assembly](https://0xax.gitbook.io/linux-insides/summary/theory/linux-theory-3)
* [Memory barrier](https://en.wikipedia.org/wiki/Memory_barrier) * [Memory barrier](https://en.wikipedia.org/wiki/Memory_barrier)
* [Lock instruction](http://x86.renejeschke.de/html/file_module_x86_id_159.html) * [Lock instruction](http://x86.renejeschke.de/html/file_module_x86_id_159.html)
* [JNS instruction](http://unixwiz.net/techtips/x86-jumps.html) * [JNS instruction](http://unixwiz.net/techtips/x86-jumps.html)
* [preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29) * [preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29)
* [Unix signals](https://en.wikipedia.org/wiki/Unix_signal) * [Unix signals](https://en.wikipedia.org/wiki/Unix_signal)
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-3.html) * [Previous part](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-3)

View File

@ -4,9 +4,9 @@ Synchronization primitives in the Linux kernel. Part 5.
Introduction Introduction
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is the fifth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html) which describes synchronization primitives in the Linux kernel and in the previous parts we finished to consider different types [spinlocks](https://en.wikipedia.org/wiki/Spinlock), [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) and [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) synchronization primitives. We will continue to learn [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) in this part and start to consider special type of synchronization primitives - [readerswriter lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock). This is the fifth part of the [chapter](https://0xax.gitbook.io/linux-insides/summary/syncprim) which describes synchronization primitives in the Linux kernel and in the previous parts we finished to consider different types [spinlocks](https://en.wikipedia.org/wiki/Spinlock), [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) and [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) synchronization primitives. We will continue to learn [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) in this part and start to consider special type of synchronization primitives - [readerswriter lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock).
The first synchronization primitive of this type will be already familiar for us - [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29). As in all previous parts of this [book](https://0xax.gitbooks.io/linux-insides/content), before we will consider implementation of the `reader/writer semaphores` in the Linux kernel, we will start from the theoretical side and will try to understand what is the difference between `reader/writer semaphores` and `normal semaphores`. The first synchronization primitive of this type will be already familiar for us - [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29). As in all previous parts of this [book](https://github.com/0xAX/linux-insides/blob/master/SUMMARY.md), before we will consider implementation of the `reader/writer semaphores` in the Linux kernel, we will start from the theoretical side and will try to understand what is the difference between `reader/writer semaphores` and `normal semaphores`.
So, let's start. So, let's start.
@ -15,7 +15,7 @@ Reader/Writer semaphore
Actually there are two types of operations may be performed on the data. We may read data and make changes in data. Two fundamental operations - `read` and `write`. Usually (but not always), `read` operation is performed more often than `write` operation. In this case, it would be logical to we may lock data in such way, that some processes may read locked data in one time, on condition that no one will not change the data. The [readers/writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) allows us to get this lock. Actually there are two types of operations may be performed on the data. We may read data and make changes in data. Two fundamental operations - `read` and `write`. Usually (but not always), `read` operation is performed more often than `write` operation. In this case, it would be logical to we may lock data in such way, that some processes may read locked data in one time, on condition that no one will not change the data. The [readers/writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) allows us to get this lock.
When a process which wants to write something into data, all other `writer` and `reader` processes will be blocked until the process which acquired a lock, will not release it. When a process reads data, other processes which want to read the same data too, will not be locked and will be able to do this. As you may guess, implementation of the `reader/writer semaphore` is based on the implementation of the `normal semaphore`. We already familiar with the [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) synchronization primitive from the third [part]((https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-4.html) of this chapter. From the theoretical side everything looks pretty simple. Let's look how `reader/writer semaphore` is represented in the Linux kernel. When a process which wants to write something into data, all other `writer` and `reader` processes will be blocked until the process which acquired a lock, will not release it. When a process reads data, other processes which want to read the same data too, will not be locked and will be able to do this. As you may guess, implementation of the `reader/writer semaphore` is based on the implementation of the `normal semaphore`. We already familiar with the [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) synchronization primitive from the third [part]((https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-4) of this chapter. From the theoretical side everything looks pretty simple. Let's look how `reader/writer semaphore` is represented in the Linux kernel.
The `semaphore` is represented by the: The `semaphore` is represented by the:
@ -57,9 +57,9 @@ config RWSEM_GENERIC_SPINLOCK
def_bool !RWSEM_XCHGADD_ALGORITHM def_bool !RWSEM_XCHGADD_ALGORITHM
``` ```
So, as this [book](https://0xax.gitbooks.io/linux-insides/content) describes only [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture related stuff, we will skip the case when the `CONFIG_RWSEM_GENERIC_SPINLOCK` kernel configuration is enabled and consider definition of the `rw_semaphore` structure only from the [include/linux/rwsem.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/rwsem.h) header file. So, as this [book](https://github.com/0xAX/linux-insides/blob/master/SUMMARY.md) describes only [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture related stuff, we will skip the case when the `CONFIG_RWSEM_GENERIC_SPINLOCK` kernel configuration is enabled and consider definition of the `rw_semaphore` structure only from the [include/linux/rwsem.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/rwsem.h) header file.
If we will take a look at the definition of the `rw_semaphore` structure, we will notice that first three fields are the same that in the `semaphore` structure. It contains `count` field which represents amount of available resources, the `wait_list` field which represents [doubly linked list](https://0xax.gitbooks.io/linux-insides/content/DataStructures/linux-datastructures-1.html) of processes which are waiting to acquire a lock and `wait_lock` [spinlock](https://en.wikipedia.org/wiki/Spinlock) for protection of this list. Notice that `rw_semaphore.count` field is `long` type unlike the same field in the `semaphore` structure. If we will take a look at the definition of the `rw_semaphore` structure, we will notice that first three fields are the same that in the `semaphore` structure. It contains `count` field which represents amount of available resources, the `wait_list` field which represents [doubly linked list](https://0xax.gitbook.io/linux-insides/summary/datastructures/linux-datastructures-1) of processes which are waiting to acquire a lock and `wait_lock` [spinlock](https://en.wikipedia.org/wiki/Spinlock) for protection of this list. Notice that `rw_semaphore.count` field is `long` type unlike the same field in the `semaphore` structure.
The `count` field of a `rw_semaphore` structure may have following values: The `count` field of a `rw_semaphore` structure may have following values:
@ -70,7 +70,7 @@ The `count` field of a `rw_semaphore` structure may have following values:
* `0xffffffff00000000` - represents situation when there are readers or writers are queued, but no one is active or is in the process of acquire of a lock; * `0xffffffff00000000` - represents situation when there are readers or writers are queued, but no one is active or is in the process of acquire of a lock;
* `0xfffffffe00000001` - a writer is active or attempting to acquire a lock and waiters are in queue. * `0xfffffffe00000001` - a writer is active or attempting to acquire a lock and waiters are in queue.
So, besides the `count` field, all of these fields are similar to fields of the `semaphore` structure. Last three fields depend on the two configuration options of the Linux kernel: the `CONFIG_RWSEM_SPIN_ON_OWNER` and `CONFIG_DEBUG_LOCK_ALLOC`. The first two fields may be familiar us by declaration of the [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) structure from the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-4.html). The first `osq` field represents [MCS lock](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) spinner for `optimistic spinning` and the second represents process which is current owner of a lock. So, besides the `count` field, all of these fields are similar to fields of the `semaphore` structure. Last three fields depend on the two configuration options of the Linux kernel: the `CONFIG_RWSEM_SPIN_ON_OWNER` and `CONFIG_DEBUG_LOCK_ALLOC`. The first two fields may be familiar us by declaration of the [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) structure from the [previous part](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-4). The first `osq` field represents [MCS lock](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) spinner for `optimistic spinning` and the second represents process which is current owner of a lock.
The last field of the `rw_semaphore` structure is - `dep_map` - debugging related, and as I already wrote in previous parts, we will skip debugging related stuff in this chapter. The last field of the `rw_semaphore` structure is - `dep_map` - debugging related, and as I already wrote in previous parts, we will skip debugging related stuff in this chapter.
@ -81,7 +81,7 @@ Reader/Writer semaphore API
So, we know a little about `reader/writer semaphores` from theoretical side, let's look on its implementation in the Linux kernel. All `reader/writer semaphores` related [API](https://en.wikipedia.org/wiki/Application_programming_interface) is located in the [include/linux/rwsem.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/rwsem.h) header file. So, we know a little about `reader/writer semaphores` from theoretical side, let's look on its implementation in the Linux kernel. All `reader/writer semaphores` related [API](https://en.wikipedia.org/wiki/Application_programming_interface) is located in the [include/linux/rwsem.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/rwsem.h) header file.
As always Before we will consider an [API](https://en.wikipedia.org/wiki/Application_programming_interface) of the `reader/writer semaphore` mechanism in the Linux kernel, we need to know how to initialize the `rw_semaphore` structure. As we already saw in previous parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html), all [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) may be initialized in two ways: As always Before we will consider an [API](https://en.wikipedia.org/wiki/Application_programming_interface) of the `reader/writer semaphore` mechanism in the Linux kernel, we need to know how to initialize the `rw_semaphore` structure. As we already saw in previous parts of this [chapter](https://0xax.gitbook.io/linux-insides/summary/syncprim), all [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) may be initialized in two ways:
* `statically`; * `statically`;
* `dynamically`. * `dynamically`.
@ -193,7 +193,7 @@ void __sched down_write(struct rw_semaphore *sem)
} }
``` ```
We already met the `might_sleep` macro in the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-4.html). In short words, Implementation of the `might_sleep` macro depends on the `CONFIG_DEBUG_ATOMIC_SLEEP` kernel configuration option and if this option is enabled, this macro just prints a stack trace if it was executed in [atomic](https://en.wikipedia.org/wiki/Linearizability) context. As this macro is mostly for debugging purpose we will skip it and will go ahead. Additionally we will skip the next macro from the `down_read` function - `rwsem_acquire` which is related to the [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) of the Linux kernel, because this is topic of other part. We already met the `might_sleep` macro in the [previous part](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-4). In short words, Implementation of the `might_sleep` macro depends on the `CONFIG_DEBUG_ATOMIC_SLEEP` kernel configuration option and if this option is enabled, this macro just prints a stack trace if it was executed in [atomic](https://en.wikipedia.org/wiki/Linearizability) context. As this macro is mostly for debugging purpose we will skip it and will go ahead. Additionally we will skip the next macro from the `down_read` function - `rwsem_acquire` which is related to the [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) of the Linux kernel, because this is topic of other part.
The only two things that remained in the `down_write` function is the call of the `LOCK_CONTENDED` macro which is defined in the [include/linux/lockdep.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/lockdep.h) header file and setting of owner of a lock with the `rwsem_set_owner` function which sets owner to currently running process: The only two things that remained in the `down_write` function is the call of the `LOCK_CONTENDED` macro which is defined in the [include/linux/lockdep.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/lockdep.h) header file and setting of owner of a lock with the `rwsem_set_owner` function which sets owner to currently running process:
@ -240,7 +240,7 @@ static inline void __down_write_nested(struct rw_semaphore *sem, int subclass)
} }
``` ```
As for other synchronization primitives which we saw in this chapter, usually `lock/unlock` functions consists only from an [inline assembly](https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-3.html) statement. As we may see, in our case the same for `__down_write_nested` function. Let's try to understand what does this function do. The first line of our assembly statement is just a comment, let's skip it. The second like contains `LOCK_PREFIX` which will be expanded to the [LOCK](http://x86.renejeschke.de/html/file_module_x86_id_159.html) instruction as we already know. The next [xadd](http://x86.renejeschke.de/html/file_module_x86_id_327.html) instruction executes `add` and `exchange` operations. In other words, `xadd` instruction adds value of the `RWSEM_ACTIVE_WRITE_BIAS`: As for other synchronization primitives which we saw in this chapter, usually `lock/unlock` functions consists only from an [inline assembly](https://0xax.gitbook.io/linux-insides/summary/theory/linux-theory-3) statement. As we may see, in our case the same for `__down_write_nested` function. Let's try to understand what does this function do. The first line of our assembly statement is just a comment, let's skip it. The second like contains `LOCK_PREFIX` which will be expanded to the [LOCK](http://x86.renejeschke.de/html/file_module_x86_id_159.html) instruction as we already know. The next [xadd](http://x86.renejeschke.de/html/file_module_x86_id_327.html) instruction executes `add` and `exchange` operations. In other words, `xadd` instruction adds value of the `RWSEM_ACTIVE_WRITE_BIAS`:
```C ```C
#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS) #define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
@ -292,7 +292,7 @@ if (rwsem_optimistic_spin(sem))
return sem; return sem;
``` ```
We will skip implementation of the `rwsem_optimistic_spin` function, as it is similar on the `mutex_optimistic_spin` function which we saw in the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-4.html). In short words we check existence other tasks ready to run that have higher priority in the `rwsem_optimistic_spin` function. If there are such tasks, the process will be added to the [MCS](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) `waitqueue` and start to spin in the loop until a lock will be able to be acquired. If `optimistic spinning` is disabled, a process will be added to the and marked as waiting for write: We will skip implementation of the `rwsem_optimistic_spin` function, as it is similar on the `mutex_optimistic_spin` function which we saw in the [previous part](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-4). In short words we check existence other tasks ready to run that have higher priority in the `rwsem_optimistic_spin` function. If there are such tasks, the process will be added to the [MCS](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) `waitqueue` and start to spin in the loop until a lock will be able to be acquired. If `optimistic spinning` is disabled, a process will be added to the and marked as waiting for write:
```C ```C
waiter.task = current; waiter.task = current;
@ -325,7 +325,7 @@ while (true) {
} }
``` ```
I will skip explanation of this loop as we already met similar functional in the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-4.html). I will skip explanation of this loop as we already met similar functional in the [previous part](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-4).
That's all. From this moment, our `writer` process will acquire or not acquire a lock depends on the value of the `rw_semaphore->count` field. Now if we will look at the implementation of the `down_read` function which executes a try of acquiring of a lock. We will see similar actions which we saw in the `down_write` function. This function calls different debugging and lock validator related functions/macros: That's all. From this moment, our `writer` process will acquire or not acquire a lock depends on the value of the `rw_semaphore->count` field. Now if we will look at the implementation of the `down_read` function which executes a try of acquiring of a lock. We will see similar actions which we saw in the `down_write` function. This function calls different debugging and lock validator related functions/macros:
@ -422,12 +422,12 @@ Links
* [Semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) * [Semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29)
* [Mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) * [Mutex](https://en.wikipedia.org/wiki/Mutual_exclusion)
* [x86_64 architecture](https://en.wikipedia.org/wiki/X86-64) * [x86_64 architecture](https://en.wikipedia.org/wiki/X86-64)
* [Doubly linked list](https://0xax.gitbooks.io/linux-insides/content/DataStructures/linux-datastructures-1.html) * [Doubly linked list](https://0xax.gitbook.io/linux-insides/summary/datastructures/linux-datastructures-1)
* [MCS lock](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) * [MCS lock](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf)
* [API](https://en.wikipedia.org/wiki/Application_programming_interface) * [API](https://en.wikipedia.org/wiki/Application_programming_interface)
* [Linux kernel lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) * [Linux kernel lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt)
* [Atomic operations](https://en.wikipedia.org/wiki/Linearizability) * [Atomic operations](https://en.wikipedia.org/wiki/Linearizability)
* [Inline assembly](https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-3.html) * [Inline assembly](https://0xax.gitbook.io/linux-insides/summary/theory/linux-theory-3)
* [XADD instruction](http://x86.renejeschke.de/html/file_module_x86_id_327.html) * [XADD instruction](http://x86.renejeschke.de/html/file_module_x86_id_327.html)
* [LOCK instruction](http://x86.renejeschke.de/html/file_module_x86_id_159.html) * [LOCK instruction](http://x86.renejeschke.de/html/file_module_x86_id_159.html)
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-4.html) * [Previous part](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-4)

View File

@ -6,20 +6,20 @@ Introduction
This is the sixth part of the chapter which describes [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_\(computer_science\)) in the Linux kernel and in the previous parts we finished to consider different [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) synchronization primitives. We will continue to learn synchronization primitives in this part and start to consider a similar synchronization primitive which can be used to avoid the `writer starvation` problem. The name of this synchronization primitive is - `seqlock` or `sequential locks`. This is the sixth part of the chapter which describes [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_\(computer_science\)) in the Linux kernel and in the previous parts we finished to consider different [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) synchronization primitives. We will continue to learn synchronization primitives in this part and start to consider a similar synchronization primitive which can be used to avoid the `writer starvation` problem. The name of this synchronization primitive is - `seqlock` or `sequential locks`.
We know from the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-5.html) that [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) is a special lock mechanism which allows concurrent access for read-only operations, but an exclusive lock is needed for writing or modifying data. As we may guess, it may lead to a problem which is called `writer starvation`. In other words, a writer process can't acquire a lock as long as at least one reader process which acquired a lock holds it. So, in the situation when contention is high, it will lead to situation when a writer process which wants to acquire a lock will wait for it for a long time. We know from the previous [part](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-5) that [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) is a special lock mechanism which allows concurrent access for read-only operations, but an exclusive lock is needed for writing or modifying data. As we may guess, it may lead to a problem which is called `writer starvation`. In other words, a writer process can't acquire a lock as long as at least one reader process which acquired a lock holds it. So, in the situation when contention is high, it will lead to situation when a writer process which wants to acquire a lock will wait for it for a long time.
The `seqlock` synchronization primitive can help solve this problem. The `seqlock` synchronization primitive can help solve this problem.
As in all previous parts of this [book](https://0xax.gitbooks.io/linux-insides/content), we will try to consider this synchronization primitive from the theoretical side and only than we will consider [API](https://en.wikipedia.org/wiki/Application_programming_interface) provided by the Linux kernel to manipulate with `seqlocks`. As in all previous parts of this [book](https://github.com/0xAX/linux-insides/blob/master/SUMMARY.md), we will try to consider this synchronization primitive from the theoretical side and only than we will consider [API](https://en.wikipedia.org/wiki/Application_programming_interface) provided by the Linux kernel to manipulate with `seqlocks`.
So, let's start. So, let's start.
Sequential lock Sequential lock
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
So, what is a `seqlock` synchronization primitive and how does it work? Let's try to answer on these questions in this paragraph. Actually `sequential locks` were introduced in the Linux kernel 2.6.x. Main point of this synchronization primitive is to provide fast and lock-free access to shared resources. Since the heart of `sequential lock` synchronization primitive is [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html) synchronization primitive, `sequential locks` work in situations where the protected resources are small and simple. Additionally write access must be rare and also should be fast. So, what is a `seqlock` synchronization primitive and how does it work? Let's try to answer on these questions in this paragraph. Actually `sequential locks` were introduced in the Linux kernel 2.6.x. Main point of this synchronization primitive is to provide fast and lock-free access to shared resources. Since the heart of `sequential lock` synchronization primitive is [spinlock](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-1) synchronization primitive, `sequential locks` work in situations where the protected resources are small and simple. Additionally write access must be rare and also should be fast.
Work of this synchronization primitive is based on the sequence of events counter. Actually a `sequential lock` allows free access to a resource for readers, but each reader must check existence of conflicts with a writer. This synchronization primitive introduces a special counter. The main algorithm of work of `sequential locks` is simple: Each writer which acquired a sequential lock increments this counter and additionally acquires a [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html). When this writer finishes, it will release the acquired spinlock to give access to other writers and increment the counter of a sequential lock again. Work of this synchronization primitive is based on the sequence of events counter. Actually a `sequential lock` allows free access to a resource for readers, but each reader must check existence of conflicts with a writer. This synchronization primitive introduces a special counter. The main algorithm of work of `sequential locks` is simple: Each writer which acquired a sequential lock increments this counter and additionally acquires a [spinlock](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-1). When this writer finishes, it will release the acquired spinlock to give access to other writers and increment the counter of a sequential lock again.
Read only access works on the following principle, it gets the value of a `sequential lock` counter before it will enter into [critical section](https://en.wikipedia.org/wiki/Critical_section) and compares it with the value of the same `sequential lock` counter at the exit of critical section. If their values are equal, this means that there weren't writers for this period. If their values are not equal, this means that a writer has incremented the counter during the [critical section](https://en.wikipedia.org/wiki/Critical_section). This conflict means that reading of protected data must be repeated. Read only access works on the following principle, it gets the value of a `sequential lock` counter before it will enter into [critical section](https://en.wikipedia.org/wiki/Critical_section) and compares it with the value of the same `sequential lock` counter at the exit of critical section. If their values are equal, this means that there weren't writers for this period. If their values are not equal, this means that a writer has incremented the counter during the [critical section](https://en.wikipedia.org/wiki/Critical_section). This conflict means that reading of protected data must be repeated.
@ -54,7 +54,7 @@ typedef struct {
} seqlock_t; } seqlock_t;
``` ```
As we may see the `seqlock_t` provides two fields. These fields represent a sequential lock counter, description of which we saw above and also a [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html) which will protect data from other writers. Note that the `seqcount` counter represented as `seqcount` type. The `seqcount` is structure: As we may see the `seqlock_t` provides two fields. These fields represent a sequential lock counter, description of which we saw above and also a [spinlock](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-1) which will protect data from other writers. Note that the `seqcount` counter represented as `seqcount` type. The `seqcount` is structure:
```C ```C
typedef struct seqcount { typedef struct seqcount {
@ -67,7 +67,7 @@ typedef struct seqcount {
which holds counter of a sequential lock and [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) related field. which holds counter of a sequential lock and [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) related field.
As always in previous parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/), before we will consider an [API](https://en.wikipedia.org/wiki/Application_programming_interface) of `sequential lock` mechanism in the Linux kernel, we need to know how to initialize an instance of `seqlock_t`. As always in previous parts of this [chapter](https://0xax.gitbook.io/linux-insides/summary/syncprim), before we will consider an [API](https://en.wikipedia.org/wiki/Application_programming_interface) of `sequential lock` mechanism in the Linux kernel, we need to know how to initialize an instance of `seqlock_t`.
We saw in the previous parts that often the Linux kernel provides two approaches to execute initialization of the given synchronization primitive. The same situation with the `seqlock_t` structure. These approaches allows to initialize a `seqlock_t` in two following: We saw in the previous parts that often the Linux kernel provides two approaches to execute initialization of the given synchronization primitive. The same situation with the `seqlock_t` structure. These approaches allows to initialize a `seqlock_t` in two following:
@ -114,7 +114,7 @@ So we just initialize counter of the given sequential lock to zero and additiona
#endif #endif
``` ```
As I already wrote in previous parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/) we will not consider [debugging](https://en.wikipedia.org/wiki/Debugging) and [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) related stuff in this part. So for now we just skip the `SEQCOUNT_DEP_MAP_INIT` macro. The second field of the given `seqlock_t` is `lock` initialized with the `__SPIN_LOCK_UNLOCKED` macro which is defined in the [include/linux/spinlock_types.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/spinlock_types.h) header file. We will not consider implementation of this macro here as it just initialize [rawspinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html) with architecture-specific methods (More abot spinlocks you may read in first parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/)). As I already wrote in previous parts of this [chapter](https://0xax.gitbook.io/linux-insides/summary/syncprim) we will not consider [debugging](https://en.wikipedia.org/wiki/Debugging) and [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) related stuff in this part. So for now we just skip the `SEQCOUNT_DEP_MAP_INIT` macro. The second field of the given `seqlock_t` is `lock` initialized with the `__SPIN_LOCK_UNLOCKED` macro which is defined in the [include/linux/spinlock_types.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/spinlock_types.h) header file. We will not consider implementation of this macro here as it just initialize [rawspinlock](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-1) with architecture-specific methods (More abot spinlocks you may read in first parts of this [chapter](https://0xax.gitbook.io/linux-insides/summary/syncprim)).
We have considered the first way to initialize a sequential lock. Let's consider second way to do the same, but do it dynamically. We can initialize a sequential lock with the `seqlock_init` macro which is defined in the same [include/linux/seqlock.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/seqlock.h) header file. We have considered the first way to initialize a sequential lock. Let's consider second way to do the same, but do it dynamically. We can initialize a sequential lock with the `seqlock_init` macro which is defined in the same [include/linux/seqlock.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/seqlock.h) header file.
@ -149,7 +149,7 @@ static inline void __seqcount_init(seqcount_t *s, const char *name,
} }
``` ```
just initializes counter of the given `seqcount_t` with zero. The second call from the `seqlock_init` macro is the call of the `spin_lock_init` macro which we saw in the [first part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html) of this chapter. just initializes counter of the given `seqcount_t` with zero. The second call from the `seqlock_init` macro is the call of the `spin_lock_init` macro which we saw in the [first part](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-1) of this chapter.
So, now we know how to initialize a `sequential lock`, now let's look at how to use it. The Linux kernel provides following [API](https://en.wikipedia.org/wiki/Application_programming_interface) to manipulate `sequential locks`: So, now we know how to initialize a `sequential lock`, now let's look at how to use it. The Linux kernel provides following [API](https://en.wikipedia.org/wiki/Application_programming_interface) to manipulate `sequential locks`:
@ -223,7 +223,7 @@ static inline int __read_seqcount_retry(const seqcount_t *s, unsigned start)
which just compares value of the counter of the given `sequential lock` with the initial value of this counter. If the initial value of the counter which is obtained from `read_seqbegin()` function is odd, this means that a writer was in the middle of updating the data when our reader began to act. In this case the value of the data can be in inconsistent state, so we need to try to read it again. which just compares value of the counter of the given `sequential lock` with the initial value of this counter. If the initial value of the counter which is obtained from `read_seqbegin()` function is odd, this means that a writer was in the middle of updating the data when our reader began to act. In this case the value of the data can be in inconsistent state, so we need to try to read it again.
This is a common pattern in the Linux kernel. For example, you may remember the `jiffies` concept from the [first part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-1.html) of the [timers and time management in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Timers/) chapter. The sequential lock is used to obtain value of `jiffies` at [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture: This is a common pattern in the Linux kernel. For example, you may remember the `jiffies` concept from the [first part](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-1) of the [timers and time management in the Linux kernel](https://0xax.gitbook.io/linux-insides/summary/timers/) chapter. The sequential lock is used to obtain value of `jiffies` at [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture:
```C ```C
u64 get_jiffies_64(void) u64 get_jiffies_64(void)
@ -303,7 +303,7 @@ static inline void raw_write_seqcount_end(seqcount_t *s)
and in the end we just call the `spin_unlock` macro to give access for other readers or writers. and in the end we just call the `spin_unlock` macro to give access for other readers or writers.
That's all about `sequential lock` mechanism in the Linux kernel. Of course we did not consider full [API](https://en.wikipedia.org/wiki/Application_programming_interface) of this mechanism in this part. But all other functions are based on these which we described here. For example, Linux kernel also provides some safe macros/functions to use `sequential lock` mechanism in [interrupt handlers](https://en.wikipedia.org/wiki/Interrupt_handler) of [softirq](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-9.html): `write_seqclock_irq` and `write_sequnlock_irq`: That's all about `sequential lock` mechanism in the Linux kernel. Of course we did not consider full [API](https://en.wikipedia.org/wiki/Application_programming_interface) of this mechanism in this part. But all other functions are based on these which we described here. For example, Linux kernel also provides some safe macros/functions to use `sequential lock` mechanism in [interrupt handlers](https://en.wikipedia.org/wiki/Interrupt_handler) of [softirq](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-9): `write_seqclock_irq` and `write_sequnlock_irq`:
```C ```C
static inline void write_seqlock_irq(seqlock_t *sl) static inline void write_seqlock_irq(seqlock_t *sl)
@ -339,14 +339,14 @@ Links
* [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_\(computer_science\)) * [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_\(computer_science\))
* [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) * [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock)
* [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html) * [spinlock](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-1)
* [critical section](https://en.wikipedia.org/wiki/Critical_section) * [critical section](https://en.wikipedia.org/wiki/Critical_section)
* [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) * [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt)
* [debugging](https://en.wikipedia.org/wiki/Debugging) * [debugging](https://en.wikipedia.org/wiki/Debugging)
* [API](https://en.wikipedia.org/wiki/Application_programming_interface) * [API](https://en.wikipedia.org/wiki/Application_programming_interface)
* [x86_64](https://en.wikipedia.org/wiki/X86-64) * [x86_64](https://en.wikipedia.org/wiki/X86-64)
* [Timers and time management in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Timers/) * [Timers and time management in the Linux kernel](https://0xax.gitbook.io/linux-insides/summary/timers/)
* [interrupt handlers](https://en.wikipedia.org/wiki/Interrupt_handler) * [interrupt handlers](https://en.wikipedia.org/wiki/Interrupt_handler)
* [softirq](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-9.html) * [softirq](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-9)
* [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_\(PC_architecture\)) * [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_\(PC_architecture\))
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-5.html) * [Previous part](https://0xax.gitbook.io/linux-insides/summary/syncprim/linux-sync-5)

View File

@ -4,7 +4,7 @@ System calls in the Linux kernel. Part 1.
Introduction Introduction
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This post opens up a new chapter in [linux-insides](https://0xax.gitbooks.io/linux-insides/content/) book, and as you may understand from the title, this chapter will be devoted to the [System call](https://en.wikipedia.org/wiki/System_call) concept in the Linux kernel. The choice of topic for this chapter is not accidental. In the previous [chapter](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) we saw interrupts and interrupt handling. The concept of system calls is very similar to that of interrupts. This is because the most common way to implement system calls is as software interrupts. We will see many different aspects that are related to the system call concept. For example, we will learn what's happening when a system call occurs from userspace. We will see an implementation of a couple system call handlers in the Linux kernel, [VDSO](https://en.wikipedia.org/wiki/VDSO) and [vsyscall](https://lwn.net/Articles/446528/) concepts and many many more. This post opens up a new chapter in [linux-insides](https://github.com/0xAX/linux-insides/blob/master/SUMMARY.md) book, and as you may understand from the title, this chapter will be devoted to the [System call](https://en.wikipedia.org/wiki/System_call) concept in the Linux kernel. The choice of topic for this chapter is not accidental. In the previous [chapter](https://0xax.gitbook.io/linux-insides/summary/interrupts) we saw interrupts and interrupt handling. The concept of system calls is very similar to that of interrupts. This is because the most common way to implement system calls is as software interrupts. We will see many different aspects that are related to the system call concept. For example, we will learn what's happening when a system call occurs from userspace. We will see an implementation of a couple system call handlers in the Linux kernel, [VDSO](https://en.wikipedia.org/wiki/VDSO) and [vsyscall](https://lwn.net/Articles/446528/) concepts and many many more.
Before we dive into Linux system call implementation, it is good to know some theory about system calls. Let's do it in the following paragraph. Before we dive into Linux system call implementation, it is good to know some theory about system calls. Let's do it in the following paragraph.
@ -416,4 +416,4 @@ Links
* [Virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system) * [Virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system)
* [systemd](https://en.wikipedia.org/wiki/Systemd) * [systemd](https://en.wikipedia.org/wiki/Systemd)
* [epoll](https://en.wikipedia.org/wiki/Epoll) * [epoll](https://en.wikipedia.org/wiki/Epoll)
* [Previous chapter](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) * [Previous chapter](https://0xax.gitbook.io/linux-insides/summary/interrupts)

View File

@ -4,7 +4,7 @@ System calls in the Linux kernel. Part 2.
How does the Linux kernel handle a system call How does the Linux kernel handle a system call
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
The previous [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-1.html) was the first part of the chapter that describes the [system call](https://en.wikipedia.org/wiki/System_call) concepts in the Linux kernel. The previous [part](https://0xax.gitbook.io/linux-insides/summary/syscall/linux-syscall-1) was the first part of the chapter that describes the [system call](https://en.wikipedia.org/wiki/System_call) concepts in the Linux kernel.
In the previous part we learned what a system call is in the Linux kernel, and in operating systems in general. This was introduced from a user-space perspective, and part of the [write](http://man7.org/linux/man-pages/man2/write.2.html) system call implementation was discussed. In this part we continue our look at system calls, starting with some theory before moving onto the Linux kernel code. In the previous part we learned what a system call is in the Linux kernel, and in operating systems in general. This was introduced from a user-space perspective, and part of the [write](http://man7.org/linux/man-pages/man2/write.2.html) system call implementation was discussed. In this part we continue our look at system calls, starting with some theory before moving onto the Linux kernel code.
A user application does not make the system call directly from our applications. We did not write the `Hello world!` program like: A user application does not make the system call directly from our applications. We did not write the `Hello world!` program like:
@ -114,7 +114,7 @@ asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
After this all elements that point to the non-implemented system calls will contain the address of the `sys_ni_syscall` function that just returns `-ENOSYS` as we saw above, and other elements will point to the `sys_syscall_name` functions. After this all elements that point to the non-implemented system calls will contain the address of the `sys_ni_syscall` function that just returns `-ENOSYS` as we saw above, and other elements will point to the `sys_syscall_name` functions.
At this point, we have filled the system call table and the Linux kernel knows where each system call handler is. But the Linux kernel does not call a `sys_syscall_name` function immediately after it is instructed to handle a system call from a user space application. Remember the [chapter](https://0xax.gitbooks.io/linux-insides/content/Interrupts/index.html) about interrupts and interrupt handling. When the Linux kernel gets the control to handle an interrupt, it had to do some preparations like save user space registers, switch to a new stack and many more tasks before it will call an interrupt handler. There is the same situation with the system call handling. The preparation for handling a system call is the first thing, but before the Linux kernel will start these preparations, the entry point of a system call must be initialized and only the Linux kernel knows how to perform this preparation. In the next paragraph we will see the process of the initialization of the system call entry in the Linux kernel. At this point, we have filled the system call table and the Linux kernel knows where each system call handler is. But the Linux kernel does not call a `sys_syscall_name` function immediately after it is instructed to handle a system call from a user space application. Remember the [chapter](https://0xax.gitbook.io/linux-insides/summary/interrupts) about interrupts and interrupt handling. When the Linux kernel gets the control to handle an interrupt, it had to do some preparations like save user space registers, switch to a new stack and many more tasks before it will call an interrupt handler. There is the same situation with the system call handling. The preparation for handling a system call is the first thing, but before the Linux kernel will start these preparations, the entry point of a system call must be initialized and only the Linux kernel knows how to perform this preparation. In the next paragraph we will see the process of the initialization of the system call entry in the Linux kernel.
Initialization of the system call entry Initialization of the system call entry
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
@ -126,7 +126,7 @@ SYSCALL invokes an OS system-call handler at privilege level 0.
It does so by loading RIP from the IA32_LSTAR MSR It does so by loading RIP from the IA32_LSTAR MSR
``` ```
it means that we need to put the system call entry in to the `IA32_LSTAR` [model specific register](https://en.wikipedia.org/wiki/Model-specific_register). This operation takes place during the Linux kernel initialization process. If you have read the fourth [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-4.html) of the chapter that describes interrupts and interrupt handling in the Linux kernel, you know that the Linux kernel calls the `trap_init` function during the initialization process. This function is defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c) source code file and executes the initialization of the `non-early` exception handlers like divide error, [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) error etc. Besides the initialization of the `non-early` exceptions handlers, this function calls the `cpu_init` function from the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/blob/arch/x86/kernel/cpu/common.c) source code file which besides initialization of `per-cpu` state, calls the `syscall_init` function from the same source code file. it means that we need to put the system call entry in to the `IA32_LSTAR` [model specific register](https://en.wikipedia.org/wiki/Model-specific_register). This operation takes place during the Linux kernel initialization process. If you have read the fourth [part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-4) of the chapter that describes interrupts and interrupt handling in the Linux kernel, you know that the Linux kernel calls the `trap_init` function during the initialization process. This function is defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c) source code file and executes the initialization of the `non-early` exception handlers like divide error, [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) error etc. Besides the initialization of the `non-early` exceptions handlers, this function calls the `cpu_init` function from the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/blob/arch/x86/kernel/cpu/common.c) source code file which besides initialization of `per-cpu` state, calls the `syscall_init` function from the same source code file.
This function performs the initialization of the system call entry point. Let's look on the implementation of this function. It does not take parameters and first of all it fills two model specific registers: This function performs the initialization of the system call entry point. Let's look on the implementation of this function. It does not take parameters and first of all it fills two model specific registers:
@ -181,7 +181,7 @@ wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL); wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
``` ```
You can read more about the `Global Descriptor Table` in the second [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) of the chapter that describes the booting process of the Linux kernel. You can read more about the `Global Descriptor Table` in the second [part](https://0xax.gitbook.io/linux-insides/summary/booting/linux-bootstrap-2) of the chapter that describes the booting process of the Linux kernel.
At the end of the `syscall_init` function, we just mask flags in the [flags register](https://en.wikipedia.org/wiki/FLAGS_register) by writing the set of flags to the `MSR_SYSCALL_MASK` model specific register: At the end of the `syscall_init` function, we just mask flags in the [flags register](https://en.wikipedia.org/wiki/FLAGS_register) by writing the set of flags to the `MSR_SYSCALL_MASK` model specific register:
@ -210,7 +210,7 @@ This macro is defined in the [arch/x86/include/asm/irqflags.h](https://github.co
#define SWAPGS_UNSAFE_STACK swapgs #define SWAPGS_UNSAFE_STACK swapgs
``` ```
which exchanges the current GS base register value with the value contained in the `MSR_KERNEL_GS_BASE ` model specific register. In other words we moved it on to the kernel stack. After this we point the old stack pointer to the `rsp_scratch` [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) variable and setup the stack pointer to point to the top of stack for the current processor: which exchanges the current GS base register value with the value contained in the `MSR_KERNEL_GS_BASE ` model specific register. In other words we moved it on to the kernel stack. After this we point the old stack pointer to the `rsp_scratch` [per-cpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1) variable and setup the stack pointer to point to the top of stack for the current processor:
```assembly ```assembly
movq %rsp, PER_CPU_VAR(rsp_scratch) movq %rsp, PER_CPU_VAR(rsp_scratch)
@ -378,7 +378,7 @@ That's all.
Conclusion Conclusion
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is the end of the second part about the system calls concept in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-1.html) we saw theory about this concept from the user application view. In this part we continued to dive into the stuff which is related to the system call concept and saw what the Linux kernel does when a system call occurs. This is the end of the second part about the system calls concept in the Linux kernel. In the previous [part](https://0xax.gitbook.io/linux-insides/summary/syscall/linux-syscall-1) we saw theory about this concept from the user application view. In this part we continued to dive into the stuff which is related to the system call concept and saw what the Linux kernel does when a system call occurs.
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new). If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
@ -402,8 +402,8 @@ Links
* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter) * [instruction pointer](https://en.wikipedia.org/wiki/Program_counter)
* [flags register](https://en.wikipedia.org/wiki/FLAGS_register) * [flags register](https://en.wikipedia.org/wiki/FLAGS_register)
* [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table) * [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table)
* [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) * [per-cpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1)
* [general purpose registers](https://en.wikipedia.org/wiki/Processor_register) * [general purpose registers](https://en.wikipedia.org/wiki/Processor_register)
* [ABI](https://en.wikipedia.org/wiki/Application_binary_interface) * [ABI](https://en.wikipedia.org/wiki/Application_binary_interface)
* [x86_64 C ABI](http://www.x86-64.org/documentation/abi.pdf) * [x86_64 C ABI](http://www.x86-64.org/documentation/abi.pdf)
* [previous chapter](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-1.html) * [previous chapter](https://0xax.gitbook.io/linux-insides/summary/syscall/linux-syscall-1)

View File

@ -4,7 +4,7 @@ System calls in the Linux kernel. Part 3.
vsyscalls and vDSO vsyscalls and vDSO
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is the third part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) that describes system calls in the Linux kernel and we saw preparations after a system call caused by a userspace application and process of handling of a system call in the previous [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-2.html). In this part we will look at two concepts that are very close to the system call concept, they are called `vsyscall` and `vdso`. This is the third part of the [chapter](https://0xax.gitbook.io/linux-insides/summary/syscall) that describes system calls in the Linux kernel and we saw preparations after a system call caused by a userspace application and process of handling of a system call in the previous [part](https://0xax.gitbook.io/linux-insides/summary/syscall/linux-syscall-2). In this part we will look at two concepts that are very close to the system call concept, they are called `vsyscall` and `vdso`.
We already know what `system call`s are. They are special routines in the Linux kernel which userspace applications ask to do privileged tasks, like to read or to write to a file, to open a socket, etc. As you may know, invoking a system call is an expensive operation in the Linux kernel, because the processor must interrupt the currently executing task and switch context to kernel mode, subsequently jumping again into userspace after the system call handler finishes its work. These two mechanisms - `vsyscall` and `vdso` are designed to speed up this process for certain system calls and in this part we will try to understand how these mechanisms work. We already know what `system call`s are. They are special routines in the Linux kernel which userspace applications ask to do privileged tasks, like to read or to write to a file, to open a socket, etc. As you may know, invoking a system call is an expensive operation in the Linux kernel, because the processor must interrupt the currently executing task and switch context to kernel mode, subsequently jumping again into userspace after the system call handler finishes its work. These two mechanisms - `vsyscall` and `vdso` are designed to speed up this process for certain system calls and in this part we will try to understand how these mechanisms work.
@ -24,7 +24,7 @@ or:
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
``` ```
After this, these system calls will be executed in userspace and this means that there will not be [context switching](https://en.wikipedia.org/wiki/Context_switch). Mapping of the `vsyscall` page occurs in the `map_vsyscall` function that is defined in the [arch/x86/entry/vsyscall/vsyscall_64.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/vsyscall/vsyscall_64.c) source code file. This function is called during the Linux kernel initialization in the `setup_arch` function that is defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c) source code file (we saw this function in the fifth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) of the Linux kernel initialization process chapter). After this, these system calls will be executed in userspace and this means that there will not be [context switching](https://en.wikipedia.org/wiki/Context_switch). Mapping of the `vsyscall` page occurs in the `map_vsyscall` function that is defined in the [arch/x86/entry/vsyscall/vsyscall_64.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/vsyscall/vsyscall_64.c) source code file. This function is called during the Linux kernel initialization in the `setup_arch` function that is defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c) source code file (we saw this function in the fifth [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-5) of the Linux kernel initialization process chapter).
Note that implementation of the `map_vsyscall` function depends on the `CONFIG_X86_VSYSCALL_EMULATION` kernel configuration option: Note that implementation of the `map_vsyscall` function depends on the `CONFIG_X86_VSYSCALL_EMULATION` kernel configuration option:
@ -49,7 +49,7 @@ void __init map_vsyscall(void)
} }
``` ```
As we can see, at the beginning of the `map_vsyscall` function we get the physical address of the `vsyscall` page with the `__pa_symbol` macro (we already saw implementation if this macro in the fourth [path](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process). The `__vsyscall_page` symbol defined in the [arch/x86/entry/vsyscall/vsyscall_emu_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/vsyscall/vsyscall_emu_64.S) assembly source code file and have the following [virtual address](https://en.wikipedia.org/wiki/Virtual_address_space): As we can see, at the beginning of the `map_vsyscall` function we get the physical address of the `vsyscall` page with the `__pa_symbol` macro (we already saw implementation if this macro in the fourth [path](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-4) of the Linux kernel initialization process). The `__vsyscall_page` symbol defined in the [arch/x86/entry/vsyscall/vsyscall_emu_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/vsyscall/vsyscall_emu_64.S) assembly source code file and have the following [virtual address](https://en.wikipedia.org/wiki/Virtual_address_space):
``` ```
ffffffff81881000 D __vsyscall_page ffffffff81881000 D __vsyscall_page
@ -80,7 +80,7 @@ __vsyscall_page:
ret ret
``` ```
Let's go back to the implementation of the `map_vsyscall` function and return to the implementation of the `__vsyscall_page` later. After we received the physical address of the `__vsyscall_page`, we check the value of the `vsyscall_mode` variable and set the [fix-mapped](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-2.html) address for the `vsyscall` page with the `__set_fixmap` macro: Let's go back to the implementation of the `map_vsyscall` function and return to the implementation of the `__vsyscall_page` later. After we received the physical address of the `__vsyscall_page`, we check the value of the `vsyscall_mode` variable and set the [fix-mapped](https://0xax.gitbook.io/linux-insides/summary/mm/linux-mm-2) address for the `vsyscall` page with the `__set_fixmap` macro:
```C ```C
if (vsyscall_mode != NONE) if (vsyscall_mode != NONE)
@ -140,9 +140,9 @@ That will be called during early kernel parameters parsing:
early_param("vsyscall", vsyscall_setup); early_param("vsyscall", vsyscall_setup);
``` ```
More about `early_param` macro you can read in the sixth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the chapter that describes process of the initialization of the Linux kernel. More about `early_param` macro you can read in the sixth [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-6) of the chapter that describes process of the initialization of the Linux kernel.
In the end of the `vsyscall_map` function we just check that virtual address of the `vsyscall` page is equal to the value of the `VSYSCALL_ADDR` with the [BUILD_BUG_ON](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) macro: In the end of the `vsyscall_map` function we just check that virtual address of the `vsyscall` page is equal to the value of the `VSYSCALL_ADDR` with the [BUILD_BUG_ON](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-1) macro:
```C ```C
BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) != BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
@ -252,7 +252,7 @@ Here we can see that [uname](https://en.wikipedia.org/wiki/Uname) util was linke
* `libc.so.6`; * `libc.so.6`;
* `ld-linux-x86-64.so.2`. * `ld-linux-x86-64.so.2`.
The first provides `vDSO` functionality, the second is `C` [standard library](https://en.wikipedia.org/wiki/C_standard_library) and the third is the program interpreter (more about this you can read in the part that describes [linkers](https://0xax.gitbooks.io/linux-insides/content/Misc/linux-misc-3.html)). So, the `vDSO` solves limitations of the `vsyscall`. Implementation of the `vDSO` is similar to `vsyscall`. The first provides `vDSO` functionality, the second is `C` [standard library](https://en.wikipedia.org/wiki/C_standard_library) and the third is the program interpreter (more about this you can read in the part that describes [linkers](https://0xax.gitbook.io/linux-insides/summary/misc/linux-misc-3)). So, the `vDSO` solves limitations of the `vsyscall`. Implementation of the `vDSO` is similar to `vsyscall`.
Initialization of the `vDSO` occurs in the `init_vdso` function that defined in the [arch/x86/entry/vdso/vma.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/vdso/vma.c) source code file. This function starts from the initialization of the `vDSO` images for 32-bits and 64-bits depends on the `CONFIG_X86_X32_ABI` kernel configuration option: Initialization of the `vDSO` occurs in the `init_vdso` function that defined in the [arch/x86/entry/vdso/vma.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/vdso/vma.c) source code file. This function starts from the initialization of the `vDSO` images for 32-bits and 64-bits depends on the `CONFIG_X86_X32_ABI` kernel configuration option:
@ -370,11 +370,11 @@ That's all.
Conclusion Conclusion
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is the end of the third part about the system calls concept in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-2.html) we discussed the implementation of the preparation from the Linux kernel side, before a system call will be handled and implementation of the `exit` process from a system call handler. In this part we continued to dive into the stuff which is related to the system call concept and learned two new concepts that are very similar to the system call - the `vsyscall` and the `vDSO`. This is the end of the third part about the system calls concept in the Linux kernel. In the previous [part](https://0xax.gitbook.io/linux-insides/summary/syscall/linux-syscall-2) we discussed the implementation of the preparation from the Linux kernel side, before a system call will be handled and implementation of the `exit` process from a system call handler. In this part we continued to dive into the stuff which is related to the system call concept and learned two new concepts that are very similar to the system call - the `vsyscall` and the `vDSO`.
After all of these three parts, we know almost all things that are related to system calls, we know what system call is and why user applications need them. We also know what occurs when a user application calls a system call and how the kernel handles system calls. After all of these three parts, we know almost all things that are related to system calls, we know what system call is and why user applications need them. We also know what occurs when a user application calls a system call and how the kernel handles system calls.
The next part will be the last part in this [chapter](https://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) and we will see what occurs when a user runs the program. The next part will be the last part in this [chapter](https://0xax.gitbook.io/linux-insides/summary/syscall) and we will see what occurs when a user runs the program.
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new). If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
@ -390,14 +390,14 @@ Links
* [virtual address](https://en.wikipedia.org/wiki/Virtual_address_space) * [virtual address](https://en.wikipedia.org/wiki/Virtual_address_space)
* [Segmentation](https://en.wikipedia.org/wiki/Memory_segmentation) * [Segmentation](https://en.wikipedia.org/wiki/Memory_segmentation)
* [enum](https://en.wikipedia.org/wiki/Enumerated_type) * [enum](https://en.wikipedia.org/wiki/Enumerated_type)
* [fix-mapped addresses](https://0xax.gitbooks.io/linux-insides/content/MM/linux-mm-2.html) * [fix-mapped addresses](https://0xax.gitbook.io/linux-insides/summary/mm/linux-mm-2)
* [glibc](https://en.wikipedia.org/wiki/GNU_C_Library) * [glibc](https://en.wikipedia.org/wiki/GNU_C_Library)
* [BUILD_BUG_ON](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) * [BUILD_BUG_ON](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-1)
* [Processor register](https://en.wikipedia.org/wiki/Processor_register) * [Processor register](https://en.wikipedia.org/wiki/Processor_register)
* [Page fault](https://en.wikipedia.org/wiki/Page_fault) * [Page fault](https://en.wikipedia.org/wiki/Page_fault)
* [segmentation fault](https://en.wikipedia.org/wiki/Segmentation_fault) * [segmentation fault](https://en.wikipedia.org/wiki/Segmentation_fault)
* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter) * [instruction pointer](https://en.wikipedia.org/wiki/Program_counter)
* [stack pointer](https://en.wikipedia.org/wiki/Stack_register) * [stack pointer](https://en.wikipedia.org/wiki/Stack_register)
* [uname](https://en.wikipedia.org/wiki/Uname) * [uname](https://en.wikipedia.org/wiki/Uname)
* [Linkers](https://0xax.gitbooks.io/linux-insides/content/Misc/linux-misc-3.html) * [Linkers](https://0xax.gitbook.io/linux-insides/summary/misc/linux-misc-3)
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-2.html) * [Previous part](https://0xax.gitbook.io/linux-insides/summary/syscall/linux-syscall-2)

View File

@ -4,7 +4,7 @@ System calls in the Linux kernel. Part 4.
How does the Linux kernel run a program How does the Linux kernel run a program
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is the fourth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) that describes [system calls](https://en.wikipedia.org/wiki/System_call) in the Linux kernel and as I wrote in the conclusion of the [previous](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-3.html) - this part will be last in this chapter. In the previous part we stopped at the two new concepts: This is the fourth part of the [chapter](https://0xax.gitbook.io/linux-insides/summary/syscall) that describes [system calls](https://en.wikipedia.org/wiki/System_call) in the Linux kernel and as I wrote in the conclusion of the [previous](https://0xax.gitbook.io/linux-insides/summary/syscall/linux-syscall-3) - this part will be last in this chapter. In the previous part we stopped at the two new concepts:
* `vsyscall`; * `vsyscall`;
* `vDSO`; * `vDSO`;
@ -73,7 +73,7 @@ So, a user application (`bash` in our case) calls the system call and as we alre
execve system call execve system call
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
We saw preparation before a system call called by a user application and after a system call handler finished its work in the second [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-2.html) of this chapter. We stopped at the call of the `execve` system call in the previous paragraph. This system call defined in the [fs/exec.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/fs/exec.c) source code file and as we already know it takes three arguments: We saw preparation before a system call called by a user application and after a system call handler finished its work in the second [part](https://0xax.gitbook.io/linux-insides/summary/syscall/linux-syscall-2) of this chapter. We stopped at the call of the `execve` system call in the previous paragraph. This system call defined in the [fs/exec.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/fs/exec.c) source code file and as we already know it takes three arguments:
``` ```
SYSCALL_DEFINE3(execve, SYSCALL_DEFINE3(execve,
@ -334,7 +334,7 @@ if (!elf_phdata)
goto out; goto out;
``` ```
that describes [segments](https://en.wikipedia.org/wiki/Memory_segmentation). Read the `program interpreter` and libraries that linked with the our executable binary file from disk and load it to memory. The `program interpreter` specified in the `.interp` section of the executable file and as you can read in the part that describes [Linkers](https://0xax.gitbooks.io/linux-insides/content/Misc/linux-misc-3.html) it is - `/lib64/ld-linux-x86-64.so.2` for the `x86_64`. It setups the stack and map `elf` binary into the correct location in memory. It maps the [bss](https://en.wikipedia.org/wiki/.bss) and the [brk](http://man7.org/linux/man-pages/man2/sbrk.2.html) sections and does many many other different things to prepare executable file to execute. that describes [segments](https://en.wikipedia.org/wiki/Memory_segmentation). Read the `program interpreter` and libraries that linked with the our executable binary file from disk and load it to memory. The `program interpreter` specified in the `.interp` section of the executable file and as you can read in the part that describes [Linkers](https://0xax.gitbook.io/linux-insides/summary/misc/linux-misc-3) it is - `/lib64/ld-linux-x86-64.so.2` for the `x86_64`. It setups the stack and map `elf` binary into the correct location in memory. It maps the [bss](https://en.wikipedia.org/wiki/.bss) and the [brk](http://man7.org/linux/man-pages/man2/sbrk.2.html) sections and does many many other different things to prepare executable file to execute.
In the end of the execution of the `load_elf_binary` we call the `start_thread` function and pass three arguments to it: In the end of the execution of the `load_elf_binary` we call the `start_thread` function and pass three arguments to it:
@ -424,7 +424,7 @@ Links
* [Alpha](https://en.wikipedia.org/wiki/DEC_Alpha) * [Alpha](https://en.wikipedia.org/wiki/DEC_Alpha)
* [FDPIC](http://elinux.org/UClinux_Shared_Library#FDPIC_ELF) * [FDPIC](http://elinux.org/UClinux_Shared_Library#FDPIC_ELF)
* [segments](https://en.wikipedia.org/wiki/Memory_segmentation) * [segments](https://en.wikipedia.org/wiki/Memory_segmentation)
* [Linkers](https://0xax.gitbooks.io/linux-insides/content/Misc/linux-misc-3.html) * [Linkers](https://0xax.gitbook.io/linux-insides/summary/misc/linux-misc-3)
* [Processor register](https://en.wikipedia.org/wiki/Processor_register) * [Processor register](https://en.wikipedia.org/wiki/Processor_register)
* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter) * [instruction pointer](https://en.wikipedia.org/wiki/Program_counter)
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-3.html) * [Previous part](https://0xax.gitbook.io/linux-insides/summary/syscall/linux-syscall-3)

View File

@ -49,7 +49,7 @@ So let's start.
Definition of the open system call Definition of the open system call
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
If you have read the [fourth part](https://github.com/0xAX/linux-insides/blob/master/SysCall/linux-syscall-4.md) of the [linux-insides](https://0xax.gitbooks.io/linux-insides/content/index.html) book, you should know that system calls are defined with the help of `SYSCALL_DEFINE` macro. So, the `open` system call is not exception. If you have read the [fourth part](https://github.com/0xAX/linux-insides/blob/master/SysCall/linux-syscall-4.md) of the [linux-insides](https://github.com/0xAX/linux-insides/blob/master/SUMMARY.md) book, you should know that system calls are defined with the help of `SYSCALL_DEFINE` macro. So, the `open` system call is not exception.
Definition of the `open` system call is located in the [fs/open.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/fs/open.c) source code file and looks pretty small for the first view: Definition of the `open` system call is located in the [fs/open.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/fs/open.c) source code file and looks pretty small for the first view:
@ -400,4 +400,4 @@ Links
* [inode](https://en.wikipedia.org/wiki/Inode) * [inode](https://en.wikipedia.org/wiki/Inode)
* [RCU](https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt) * [RCU](https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt)
* [read](http://man7.org/linux/man-pages/man2/read.2.html) * [read](http://man7.org/linux/man-pages/man2/read.2.html)
* [previous part](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-4.html) * [previous part](https://0xax.gitbook.io/linux-insides/summary/syscall/linux-syscall-4)

View File

@ -4,7 +4,7 @@ Paging
Introduction Introduction
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
In the fifth [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) of the series `Linux kernel booting process` we learned about what the kernel does in its earliest stage. In the next step the kernel will initialize different things like `initrd` mounting, lockdep initialization, and many many other things, before we can see how the kernel runs the first init process. In the fifth [part](https://0xax.gitbook.io/linux-insides/summary/booting/linux-bootstrap-5) of the series `Linux kernel booting process` we learned about what the kernel does in its earliest stage. In the next step the kernel will initialize different things like `initrd` mounting, lockdep initialization, and many many other things, before we can see how the kernel runs the first init process.
Yeah, there will be many different things, but many many and once again many work with **memory**. Yeah, there will be many different things, but many many and once again many work with **memory**.
@ -259,4 +259,4 @@ Links
* [MMU](http://en.wikipedia.org/wiki/Memory_management_unit) * [MMU](http://en.wikipedia.org/wiki/Memory_management_unit)
* [ELF64](https://github.com/0xAX/linux-insides/blob/master/Theory/ELF.md) * [ELF64](https://github.com/0xAX/linux-insides/blob/master/Theory/ELF.md)
* [Documentation/x86/x86_64/mm.txt](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/x86/x86_64/mm.txt) * [Documentation/x86/x86_64/mm.txt](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/x86/x86_64/mm.txt)
* [Last part - Kernel booting process](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) * [Last part - Kernel booting process](https://0xax.gitbook.io/linux-insides/summary/booting/linux-bootstrap-5)

View File

@ -40,7 +40,7 @@ The `asm` keyword may be used in place of `__asm__`, however `__asm__` is portab
If you know assembly programming language this looks pretty familiar. The main problem is in the second form of inline assembly statements - `extended`. This form allows us to pass parameters to an assembly statement, perform [jumps](https://en.wikipedia.org/wiki/Branch_%28computer_science%29) etc. Does not sound difficult, but requires knowledge of special rules in addition to knowledge of the assembly language. Every time I see yet another piece of inline assembly code in the Linux kernel, I need to refer to the official [documentation](https://gcc.gnu.org/onlinedocs/) of `GCC` to remember how a particular `qualifier` behaves or what the meaning of `=&r` is for example. If you know assembly programming language this looks pretty familiar. The main problem is in the second form of inline assembly statements - `extended`. This form allows us to pass parameters to an assembly statement, perform [jumps](https://en.wikipedia.org/wiki/Branch_%28computer_science%29) etc. Does not sound difficult, but requires knowledge of special rules in addition to knowledge of the assembly language. Every time I see yet another piece of inline assembly code in the Linux kernel, I need to refer to the official [documentation](https://gcc.gnu.org/onlinedocs/) of `GCC` to remember how a particular `qualifier` behaves or what the meaning of `=&r` is for example.
I've decided to write this part to consolidate my knowledge related to the inline assembly, as inline assembly statements are quite common in the Linux kernel and we may see them in [linux-insides](https://0xax.gitbooks.io/linux-insides/content/) parts sometimes. I thought that it would be useful if we have a special part which contains information on more important aspects of the inline assembly. Of course you may find comprehensive information about inline assembly in the official [documentation](https://gcc.gnu.org/onlinedocs/gcc/Using-Assembly-Language-with-C.html#Using-Assembly-Language-with-C), but I like to put everything in one place. I've decided to write this part to consolidate my knowledge related to the inline assembly, as inline assembly statements are quite common in the Linux kernel and we may see them in [linux-insides](https://github.com/0xAX/linux-insides/blob/master/SUMMARY.md) parts sometimes. I thought that it would be useful if we have a special part which contains information on more important aspects of the inline assembly. Of course you may find comprehensive information about inline assembly in the official [documentation](https://gcc.gnu.org/onlinedocs/gcc/Using-Assembly-Language-with-C.html#Using-Assembly-Language-with-C), but I like to put everything in one place.
** Note: This part will not provide guide for assembly programming. It is not intended to teach you to write programs with assembler or to know what one or another assembler instruction means. Just a little memo for extended asm. ** ** Note: This part will not provide guide for assembly programming. It is not intended to teach you to write programs with assembler or to know what one or another assembler instruction means. Just a little memo for extended asm. **

View File

@ -4,16 +4,16 @@ Timers and time management in the Linux kernel. Part 1.
Introduction Introduction
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is yet another post that opens a new chapter in the [linux-insides](https://0xax.gitbooks.io/linux-insides/content/) book. The previous [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-4.html) described [system call](https://en.wikipedia.org/wiki/System_call) concepts, and now it's time to start new chapter. As one might understand from the title, this chapter will be devoted to the `timers` and `time management` in the Linux kernel. The choice of topic for the current chapter is not accidental. Timers (and generally, time management) are very important and widely used in the Linux kernel. The Linux kernel uses timers for various tasks, for example different timeouts in the [TCP](https://en.wikipedia.org/wiki/Transmission_Control_Protocol) implementation, the kernel knowing current time, scheduling asynchronous functions, next event interrupt scheduling and many many more. This is yet another post that opens a new chapter in the [linux-insides](https://github.com/0xAX/linux-insides/blob/master/SUMMARY.md) book. The previous [part](https://0xax.gitbook.io/linux-insides/summary/syscall/linux-syscall-4) described [system call](https://en.wikipedia.org/wiki/System_call) concepts, and now it's time to start new chapter. As one might understand from the title, this chapter will be devoted to the `timers` and `time management` in the Linux kernel. The choice of topic for the current chapter is not accidental. Timers (and generally, time management) are very important and widely used in the Linux kernel. The Linux kernel uses timers for various tasks, for example different timeouts in the [TCP](https://en.wikipedia.org/wiki/Transmission_Control_Protocol) implementation, the kernel knowing current time, scheduling asynchronous functions, next event interrupt scheduling and many many more.
So, we will start to learn implementation of the different time management related stuff in this part. We will see different types of timers and how different Linux kernel subsystems use them. As always, we will start from the earliest part of the Linux kernel and go through the initialization process of the Linux kernel. We already did it in the special [chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) which describes the initialization process of the Linux kernel, but as you may remember we missed some things there. And one of them is the initialization of timers. So, we will start to learn implementation of the different time management related stuff in this part. We will see different types of timers and how different Linux kernel subsystems use them. As always, we will start from the earliest part of the Linux kernel and go through the initialization process of the Linux kernel. We already did it in the special [chapter](https://0xax.gitbook.io/linux-insides/summary/initialization) which describes the initialization process of the Linux kernel, but as you may remember we missed some things there. And one of them is the initialization of timers.
Let's start. Let's start.
Initialization of non-standard PC hardware clock Initialization of non-standard PC hardware clock
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
After the Linux kernel was decompressed (more about this you can read in the [Kernel decompression](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) part) the architecture non-specific code starts to work in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) source code file. After initialization of the [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt), initialization of [cgroups](https://en.wikipedia.org/wiki/Cgroups) and setting [canary](https://en.wikipedia.org/wiki/Buffer_overflow_protection) value we can see the call of the `setup_arch` function. After the Linux kernel was decompressed (more about this you can read in the [Kernel decompression](https://0xax.gitbook.io/linux-insides/summary/booting/linux-bootstrap-5) part) the architecture non-specific code starts to work in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) source code file. After initialization of the [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt), initialization of [cgroups](https://en.wikipedia.org/wiki/Cgroups) and setting [canary](https://en.wikipedia.org/wiki/Buffer_overflow_protection) value we can see the call of the `setup_arch` function.
As you may remember, this function (defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c#L842)) prepares/initializes architecture-specific stuff (for example it reserves a place for [bss](https://en.wikipedia.org/wiki/.bss) section, reserves a place for [initrd](https://en.wikipedia.org/wiki/Initrd), parses kernel command line, and many, many other things). Besides this, we can find some time management related functions there. As you may remember, this function (defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c#L842)) prepares/initializes architecture-specific stuff (for example it reserves a place for [bss](https://en.wikipedia.org/wiki/.bss) section, reserves a place for [initrd](https://en.wikipedia.org/wiki/Initrd), parses kernel command line, and many, many other things). Besides this, we can find some time management related functions there.
@ -433,4 +433,4 @@ Links
* [Intel 8253](https://en.wikipedia.org/wiki/Intel_8253) * [Intel 8253](https://en.wikipedia.org/wiki/Intel_8253)
* [seqlocks](https://en.wikipedia.org/wiki/Seqlock) * [seqlocks](https://en.wikipedia.org/wiki/Seqlock)
* [cloksource documentation](https://www.kernel.org/doc/Documentation/timers/timekeeping.txt) * [cloksource documentation](https://www.kernel.org/doc/Documentation/timers/timekeeping.txt)
* [Previous chapter](https://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) * [Previous chapter](https://0xax.gitbook.io/linux-insides/summary/syscall)

View File

@ -4,7 +4,7 @@ Timers and time management in the Linux kernel. Part 2.
Introduction to the `clocksource` framework Introduction to the `clocksource` framework
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
The previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-1.html) was the first part in the current [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) that describes timers and time management related stuff in the Linux kernel. We got acquainted with two concepts in the previous part: The previous [part](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-1) was the first part in the current [chapter](https://0xax.gitbook.io/linux-insides/summary/timers/) that describes timers and time management related stuff in the Linux kernel. We got acquainted with two concepts in the previous part:
* `jiffies` * `jiffies`
* `clocksource` * `clocksource`
@ -92,7 +92,7 @@ Within this framework, each clock source is required to maintain a representatio
The `clocksource` structure The `clocksource` structure
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
The fundamental of the `clocksource` framework is the `clocksource` structure that defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/clocksource.h) header file. We already saw some fields that are provided by the `clocksource` structure in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-1.html). Let's look on the full definition of this structure and try to describe all of its fields: The fundamental of the `clocksource` framework is the `clocksource` structure that defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/clocksource.h) header file. We already saw some fields that are provided by the `clocksource` structure in the previous [part](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-1). Let's look on the full definition of this structure and try to describe all of its fields:
```C ```C
struct clocksource { struct clocksource {
@ -197,7 +197,7 @@ That's all. From this moment we know all fields of the `clocksource` structure.
New `clocksource` registration New `clocksource` registration
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
We saw only one function from the `clocksource` framework in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-1.html). This function was - `__clocksource_register`. This function defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/tree/master/include/linux/clocksource.h) header file and as we can understand from the function's name, main point of this function is to register new clocksource. If we will look on the implementation of the `__clocksource_register` function, we will see that it just makes call of the `__clocksource_register_scale` function and returns its result: We saw only one function from the `clocksource` framework in the previous [part](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-1). This function was - `__clocksource_register`. This function defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/tree/master/include/linux/clocksource.h) header file and as we can understand from the function's name, main point of this function is to register new clocksource. If we will look on the implementation of the `__clocksource_register` function, we will see that it just makes call of the `__clocksource_register_scale` function and returns its result:
```C ```C
static inline int __clocksource_register(struct clocksource *cs) static inline int __clocksource_register(struct clocksource *cs)
@ -241,7 +241,7 @@ int __clocksource_register_scale(struct clocksource *cs, u32 scale, u32 freq)
} }
``` ```
First of all we can see that the `__clocksource_register_scale` function starts from the call of the `__clocksource_update_freq_scale` function that defined in the same source code file and updates given clock source with the new frequency. Let's look on the implementation of this function. In the first step we need to check given frequency and if it was not passed as `zero`, we need to calculate `mult` and `shift` parameters for the given clock source. Why do we need to check value of the `frequency`? Actually it can be zero. if you attentively looked on the implementation of the `__clocksource_register` function, you may have noticed that we passed `frequency` as `0`. We will do it only for some clock sources that have self defined `mult` and `shift` parameters. Look in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-1.html) and you will see that we saw calculation of the `mult` and `shift` for `jiffies`. The `__clocksource_update_freq_scale` function will do it for us for other clock sources. First of all we can see that the `__clocksource_register_scale` function starts from the call of the `__clocksource_update_freq_scale` function that defined in the same source code file and updates given clock source with the new frequency. Let's look on the implementation of this function. In the first step we need to check given frequency and if it was not passed as `zero`, we need to calculate `mult` and `shift` parameters for the given clock source. Why do we need to check value of the `frequency`? Actually it can be zero. if you attentively looked on the implementation of the `__clocksource_register` function, you may have noticed that we passed `frequency` as `0`. We will do it only for some clock sources that have self defined `mult` and `shift` parameters. Look in the previous [part](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-1) and you will see that we saw calculation of the `mult` and `shift` for `jiffies`. The `__clocksource_update_freq_scale` function will do it for us for other clock sources.
So in the start of the `__clocksource_update_freq_scale` function we check the value of the `frequency` parameter and if is not zero we need to calculate `mult` and `shift` for the given clock source. Let's look on the `mult` and `shift` calculation: So in the start of the `__clocksource_update_freq_scale` function we check the value of the `frequency` parameter and if is not zero we need to calculate `mult` and `shift` for the given clock source. Let's look on the `mult` and `shift` calculation:
@ -448,4 +448,4 @@ Links
* [clock rate](https://en.wikipedia.org/wiki/Clock_rate) * [clock rate](https://en.wikipedia.org/wiki/Clock_rate)
* [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) * [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion)
* [sysfs](https://en.wikipedia.org/wiki/Sysfs) * [sysfs](https://en.wikipedia.org/wiki/Sysfs)
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-1.html) * [previous part](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-1)

View File

@ -4,7 +4,7 @@ Timers and time management in the Linux kernel. Part 3.
The tick broadcast framework and dyntick The tick broadcast framework and dyntick
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is third part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel and we stopped on the `clocksource` framework in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-2.html). We have started to consider this framework because it is closely related to the special counters which are provided by the Linux kernel. One of these counters which we already saw in the first [part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-1.html) of this chapter is - `jiffies`. As I already wrote in the first part of this chapter, we will consider time management related stuff step by step during the Linux kernel initialization. Previous step was call of the: This is third part of the [chapter](https://0xax.gitbook.io/linux-insides/summary/timers/) which describes timers and time management related stuff in the Linux kernel and we stopped on the `clocksource` framework in the previous [part](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-2). We have started to consider this framework because it is closely related to the special counters which are provided by the Linux kernel. One of these counters which we already saw in the first [part](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-1) of this chapter is - `jiffies`. As I already wrote in the first part of this chapter, we will consider time management related stuff step by step during the Linux kernel initialization. Previous step was call of the:
```C ```C
register_refined_jiffies(CLOCK_TICK_RATE); register_refined_jiffies(CLOCK_TICK_RATE);
@ -12,7 +12,7 @@ register_refined_jiffies(CLOCK_TICK_RATE);
function which defined in the [kernel/time/jiffies.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/time/jiffies.c) source code file and executes initialization of the `refined_jiffies` clock source for us. Recall that this function is called from the `setup_arch` function that defined in the [https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c](arch/x86/kernel/setup.c) source code and executes architecture-specific ([x86_64](https://en.wikipedia.org/wiki/X86-64) in our case) initialization. Look on the implementation of the `setup_arch` and you will note that the call of the `register_refined_jiffies` is the last step before the `setup_arch` function will finish its work. function which defined in the [kernel/time/jiffies.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/time/jiffies.c) source code file and executes initialization of the `refined_jiffies` clock source for us. Recall that this function is called from the `setup_arch` function that defined in the [https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c](arch/x86/kernel/setup.c) source code and executes architecture-specific ([x86_64](https://en.wikipedia.org/wiki/X86-64) in our case) initialization. Look on the implementation of the `setup_arch` and you will note that the call of the `register_refined_jiffies` is the last step before the `setup_arch` function will finish its work.
There are many different `x86_64` specific things already configured after the end of the `setup_arch` execution. For example some early [interrupt](https://en.wikipedia.org/wiki/Interrupt) handlers already able to handle interrupts, memory space reserved for the [initrd](https://en.wikipedia.org/wiki/Initrd), [DMI](https://en.wikipedia.org/wiki/Desktop_Management_Interface) scanned, the Linux kernel log buffer is already set and this means that the [printk](https://en.wikipedia.org/wiki/Printk) function is able to work, [e820](https://en.wikipedia.org/wiki/E820) parsed and the Linux kernel already knows about available memory and and many many other architecture specific things (if you are interesting, you can read more about the `setup_arch` function and Linux kernel initialization process in the second [chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) of this book). There are many different `x86_64` specific things already configured after the end of the `setup_arch` execution. For example some early [interrupt](https://en.wikipedia.org/wiki/Interrupt) handlers already able to handle interrupts, memory space reserved for the [initrd](https://en.wikipedia.org/wiki/Initrd), [DMI](https://en.wikipedia.org/wiki/Desktop_Management_Interface) scanned, the Linux kernel log buffer is already set and this means that the [printk](https://en.wikipedia.org/wiki/Printk) function is able to work, [e820](https://en.wikipedia.org/wiki/E820) parsed and the Linux kernel already knows about available memory and and many many other architecture specific things (if you are interesting, you can read more about the `setup_arch` function and Linux kernel initialization process in the second [chapter](https://0xax.gitbook.io/linux-insides/summary/initialization) of this book).
Now, the `setup_arch` finished its work and we can back to the generic Linux kernel code. Recall that the `setup_arch` function was called from the `start_kernel` function which is defined in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) source code file. So, we shall return to this function. You can see that there are many different function are called right after `setup_arch` function inside of the `start_kernel` function, but since our chapter is devoted to timers and time management related stuff, we will skip all code which is not related to this topic. The first function which is related to the time management in the Linux kernel is: Now, the `setup_arch` finished its work and we can back to the generic Linux kernel code. Recall that the `setup_arch` function was called from the `start_kernel` function which is defined in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) source code file. So, we shall return to this function. You can see that there are many different function are called right after `setup_arch` function inside of the `start_kernel` function, but since our chapter is devoted to timers and time management related stuff, we will skip all code which is not related to this topic. The first function which is related to the time management in the Linux kernel is:
@ -42,7 +42,7 @@ void __init tick_init(void)
As you can understand from the paragraph's title, we are interesting only in the `tick_broadcast_init` function for now. This function defined in the [kernel/time/tick-broadcast.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/time/tick-broadcast.c) source code file and executes initialization of the `tick broadcast` framework related data structures. Before we will look on the implementation of the `tick_broadcast_init` function and will try to understand what does this function do, we need to know about `tick broadcast` framework. As you can understand from the paragraph's title, we are interesting only in the `tick_broadcast_init` function for now. This function defined in the [kernel/time/tick-broadcast.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/time/tick-broadcast.c) source code file and executes initialization of the `tick broadcast` framework related data structures. Before we will look on the implementation of the `tick_broadcast_init` function and will try to understand what does this function do, we need to know about `tick broadcast` framework.
Main point of a central processor is to execute programs. But sometimes a processor may be in a special state when it is not being used by any program. This special state is called - [idle](https://en.wikipedia.org/wiki/Idle_%28CPU%29). When the processor has no anything to execute, the Linux kernel launches `idle` task. We already saw a little about this in the last part of the [Linux kernel initialization process](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-10.html). When the Linux kernel will finish all initialization processes in the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) source code file, it will call the `rest_init` function from the same source code file. Main point of this function is to launch kernel `init` thread and the `kthreadd` thread, to call the `schedule` function to start task scheduling and to go to sleep by calling the `cpu_idle_loop` function that defined in the [kernel/sched/idle.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/sched/idle.c) source code file. Main point of a central processor is to execute programs. But sometimes a processor may be in a special state when it is not being used by any program. This special state is called - [idle](https://en.wikipedia.org/wiki/Idle_%28CPU%29). When the processor has no anything to execute, the Linux kernel launches `idle` task. We already saw a little about this in the last part of the [Linux kernel initialization process](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-10). When the Linux kernel will finish all initialization processes in the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) source code file, it will call the `rest_init` function from the same source code file. Main point of this function is to launch kernel `init` thread and the `kthreadd` thread, to call the `schedule` function to start task scheduling and to go to sleep by calling the `cpu_idle_loop` function that defined in the [kernel/sched/idle.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/sched/idle.c) source code file.
The `cpu_idle_loop` function represents infinite loop which checks the need for rescheduling on each iteration. After the scheduler finds something to execute, the `idle` process will finish its work and the control will be moved to a new runnable task with the call of the `schedule_preempt_disabled` function: The `cpu_idle_loop` function represents infinite loop which checks the need for rescheduling on each iteration. After the scheduler finds something to execute, the `idle` process will finish its work and the control will be moved to a new runnable task with the call of the `schedule_preempt_disabled` function:
@ -102,7 +102,7 @@ void __init tick_broadcast_init(void)
} }
``` ```
As we can see, the `tick_broadcast_init` function allocates different [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) with the help of the `zalloc_cpumask_var` function. The `zalloc_cpumask_var` function defined in the [lib/cpumask.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/lib/cpumask.c) source code file and expands to the call of the following function: As we can see, the `tick_broadcast_init` function allocates different [cpumasks](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-2) with the help of the `zalloc_cpumask_var` function. The `zalloc_cpumask_var` function defined in the [lib/cpumask.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/lib/cpumask.c) source code file and expands to the call of the following function:
```C ```C
bool zalloc_cpumask_var(cpumask_var_t *mask, gfp_t flags) bool zalloc_cpumask_var(cpumask_var_t *mask, gfp_t flags)
@ -243,7 +243,7 @@ int tick_device_uses_broadcast(struct clock_event_device *dev, int cpu)
} }
``` ```
More about the `smp_processor_id` macro you can read in the fourth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process chapter. More about the `smp_processor_id` macro you can read in the fourth [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-4) of the Linux kernel initialization process chapter.
The `tick_broadcast_start_periodic` function check the given `clock event` device and call the `tick_setup_periodic` function: The `tick_broadcast_start_periodic` function check the given `clock event` device and call the `tick_setup_periodic` function:
@ -407,7 +407,7 @@ for_each_cpu(cpu, tick_nohz_full_mask)
context_tracking_cpu_set(cpu); context_tracking_cpu_set(cpu);
``` ```
The `context_tracking_cpu_set` function defined in the [kernel/context_tracking.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/context_tracking.c) source code file and main point of this function is to set the `context_tracking.active` [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) variable to `true`. When the `active` field will be set to `true` for the certain processor, all [context switches](https://en.wikipedia.org/wiki/Context_switch) will be ignored by the Linux kernel context tracking subsystem for this processor. The `context_tracking_cpu_set` function defined in the [kernel/context_tracking.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/context_tracking.c) source code file and main point of this function is to set the `context_tracking.active` [percpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1) variable to `true`. When the `active` field will be set to `true` for the certain processor, all [context switches](https://en.wikipedia.org/wiki/Context_switch) will be ignored by the Linux kernel context tracking subsystem for this processor.
That's all. This is the end of the `tick_nohz_init` function. After this `NO_HZ` related data structures will be initialized. We didn't see API of the `NO_HZ` mode, but will see it soon. That's all. This is the end of the `tick_nohz_init` function. After this `NO_HZ` related data structures will be initialized. We didn't see API of the `NO_HZ` mode, but will see it soon.
@ -433,12 +433,12 @@ Links
* [CPU idle](https://en.wikipedia.org/wiki/Idle_%28CPU%29) * [CPU idle](https://en.wikipedia.org/wiki/Idle_%28CPU%29)
* [power management](https://en.wikipedia.org/wiki/Power_management) * [power management](https://en.wikipedia.org/wiki/Power_management)
* [NO_HZ documentation](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/timers/NO_HZ.txt) * [NO_HZ documentation](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/timers/NO_HZ.txt)
* [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) * [cpumasks](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-2)
* [high precision event timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) * [high precision event timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer)
* [irq](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) * [irq](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
* [IPI](https://en.wikipedia.org/wiki/Inter-processor_interrupt) * [IPI](https://en.wikipedia.org/wiki/Inter-processor_interrupt)
* [CPUID](https://en.wikipedia.org/wiki/CPUID) * [CPUID](https://en.wikipedia.org/wiki/CPUID)
* [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) * [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)
* [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) * [percpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1)
* [context switches](https://en.wikipedia.org/wiki/Context_switch) * [context switches](https://en.wikipedia.org/wiki/Context_switch)
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-2.html) * [Previous part](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-2)

View File

@ -4,7 +4,7 @@ Timers and time management in the Linux kernel. Part 4.
Timers Timers
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is fourth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-3.html) we knew about the `tick broadcast` framework and `NO_HZ` mode in the Linux kernel. We will continue to dive into the time management related stuff in the Linux kernel in this part and will be acquainted with yet another concept in the Linux kernel - `timers`. Before we will look at timers in the Linux kernel, we have to learn some theory about this concept. Note that we will consider software timers in this part. This is fourth part of the [chapter](https://0xax.gitbook.io/linux-insides/summary/timers/) which describes timers and time management related stuff in the Linux kernel and in the previous [part](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-3) we knew about the `tick broadcast` framework and `NO_HZ` mode in the Linux kernel. We will continue to dive into the time management related stuff in the Linux kernel in this part and will be acquainted with yet another concept in the Linux kernel - `timers`. Before we will look at timers in the Linux kernel, we have to learn some theory about this concept. Note that we will consider software timers in this part.
The Linux kernel provides a `software timer` concept to allow to kernel functions could be invoked at future moment. Timers are widely used in the Linux kernel. For example, look in the [net/netfilter/ipset/ip_set_list_set.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/net/netfilter/ipset/ip_set_list_set.c) source code file. This source code file provides implementation of the framework for the managing of groups of [IP](https://en.wikipedia.org/wiki/Internet_Protocol) addresses. The Linux kernel provides a `software timer` concept to allow to kernel functions could be invoked at future moment. Timers are widely used in the Linux kernel. For example, look in the [net/netfilter/ipset/ip_set_list_set.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/net/netfilter/ipset/ip_set_list_set.c) source code file. This source code file provides implementation of the framework for the managing of groups of [IP](https://en.wikipedia.org/wiki/Internet_Protocol) addresses.
@ -45,7 +45,7 @@ Now let's continue to research source code of Linux kernel which is related to t
Introduction to dynamic timers in the Linux kernel Introduction to dynamic timers in the Linux kernel
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
As I already wrote, we knew about the `tick broadcast` framework and `NO_HZ` mode in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-3.html). They will be initialized in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) source code file by the call of the `tick_init` function. If we will look at this source code file, we will see that the next time management related function is: As I already wrote, we knew about the `tick broadcast` framework and `NO_HZ` mode in the previous [part](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-3). They will be initialized in the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) source code file by the call of the `tick_init` function. If we will look at this source code file, we will see that the next time management related function is:
```C ```C
init_timers(); init_timers();
@ -75,7 +75,7 @@ static void __init init_timer_cpus(void)
} }
``` ```
If you do not know or do not remember what is it a `possible` cpu, you can read the special [part](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) of this book which describes `cpumask` concept in the Linux kernel. In short words, a `possible` processor is a processor which can be plugged in anytime during the life of the system. If you do not know or do not remember what is it a `possible` cpu, you can read the special [part](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-2) of this book which describes `cpumask` concept in the Linux kernel. In short words, a `possible` processor is a processor which can be plugged in anytime during the life of the system.
The `init_timer_cpu` function does main work for us, namely it executes initialization of the `tvec_base` structure for each processor. This structure defined in the [kernel/time/timer.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/time/timer.c) source code file and stores data related to a `dynamic` timer for a certain processor. Let's look on the definition of this structure: The `init_timer_cpu` function does main work for us, namely it executes initialization of the `tvec_base` structure for each processor. This structure defined in the [kernel/time/timer.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/time/timer.c) source code file and stores data related to a `dynamic` timer for a certain processor. Let's look on the definition of this structure:
@ -136,13 +136,13 @@ static void __init init_timer_cpu(int cpu)
} }
``` ```
The `tvec_bases` represents [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) variable which represents main data structure for a dynamic timer for a given processor. This `per-cpu` variable defined in the same source code file: The `tvec_bases` represents [per-cpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1) variable which represents main data structure for a dynamic timer for a given processor. This `per-cpu` variable defined in the same source code file:
```C ```C
static DEFINE_PER_CPU(struct tvec_base, tvec_bases); static DEFINE_PER_CPU(struct tvec_base, tvec_bases);
``` ```
First of all we're getting the address of the `tvec_bases` for the given processor to `base` variable and as we got it, we are starting to initialize some of the `tvec_base` fields in the `init_timer_cpu` function. After initialization of the `per-cpu` dynamic timers with the [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-1.html) and the number of a possible processor, we need to initialize a `tstats_lookup_lock` [spinlock](https://en.wikipedia.org/wiki/Spinlock) in the `init_timer_stats` function: First of all we're getting the address of the `tvec_bases` for the given processor to `base` variable and as we got it, we are starting to initialize some of the `tvec_base` fields in the `init_timer_cpu` function. After initialization of the `per-cpu` dynamic timers with the [jiffies](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-1) and the number of a possible processor, we need to initialize a `tstats_lookup_lock` [spinlock](https://en.wikipedia.org/wiki/Spinlock) in the `init_timer_stats` function:
```C ```C
void __init init_timer_stats(void) void __init init_timer_stats(void)
@ -240,7 +240,7 @@ The last step in the `init_timers` function is the call of the:
open_softirq(TIMER_SOFTIRQ, run_timer_softirq); open_softirq(TIMER_SOFTIRQ, run_timer_softirq);
``` ```
function. The `open_softirq` function may be already familiar to you if you have read the ninth [part](https://0xax.gitbooks.io/linux-insides/content/Interrupts/linux-interrupts-9.html) about the interrupts and interrupt handling in the Linux kernel. In short words, the `open_softirq` function defined in the [kernel/softirq.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/softirq.c) source code file and executes initialization of the deferred interrupt handler. function. The `open_softirq` function may be already familiar to you if you have read the ninth [part](https://0xax.gitbook.io/linux-insides/summary/interrupts/linux-interrupts-9) about the interrupts and interrupt handling in the Linux kernel. In short words, the `open_softirq` function defined in the [kernel/softirq.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/softirq.c) source code file and executes initialization of the deferred interrupt handler.
In our case the deferred function is the `run_timer_softirq` function that is will be called after a hardware interrupt in the `do_IRQ` function which defined in the [arch/x86/kernel/irq.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/irq.c) source code file. The main point of this function is to handle a software dynamic timer. The Linux kernel does not do this thing during the hardware timer interrupt handling because this is time consuming operation. In our case the deferred function is the `run_timer_softirq` function that is will be called after a hardware interrupt in the `do_IRQ` function which defined in the [arch/x86/kernel/irq.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/irq.c) source code file. The main point of this function is to handle a software dynamic timer. The Linux kernel does not do this thing during the hardware timer interrupt handling because this is time consuming operation.
@ -256,7 +256,7 @@ static void run_timer_softirq(struct softirq_action *h)
} }
``` ```
At the beginning of the `run_timer_softirq` function we get a `dynamic` timer for a current processor and compares the current value of the [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-1.html) with the value of the `timer_jiffies` for the current structure by the call of the `time_after_eq` macro which is defined in the [include/linux/jiffies.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/jiffies.h) header file: At the beginning of the `run_timer_softirq` function we get a `dynamic` timer for a current processor and compares the current value of the [jiffies](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-1) with the value of the `timer_jiffies` for the current structure by the call of the `time_after_eq` macro which is defined in the [include/linux/jiffies.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/jiffies.h) header file:
```C ```C
#define time_after_eq(a,b) \ #define time_after_eq(a,b) \
@ -418,10 +418,10 @@ Links
* [IP](https://en.wikipedia.org/wiki/Internet_Protocol) * [IP](https://en.wikipedia.org/wiki/Internet_Protocol)
* [netfilter](https://en.wikipedia.org/wiki/Netfilter) * [netfilter](https://en.wikipedia.org/wiki/Netfilter)
* [network](https://en.wikipedia.org/wiki/Computer_network) * [network](https://en.wikipedia.org/wiki/Computer_network)
* [cpumask](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) * [cpumask](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-2)
* [interrupt](https://en.wikipedia.org/wiki/Interrupt) * [interrupt](https://en.wikipedia.org/wiki/Interrupt)
* [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-1.html) * [jiffies](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-1)
* [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-1.html) * [per-cpu](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-1)
* [spinlock](https://en.wikipedia.org/wiki/Spinlock) * [spinlock](https://en.wikipedia.org/wiki/Spinlock)
* [procfs](https://en.wikipedia.org/wiki/Procfs) * [procfs](https://en.wikipedia.org/wiki/Procfs)
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-3.html) * [previous part](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-3)

View File

@ -4,7 +4,7 @@ Timers and time management in the Linux kernel. Part 5.
Introduction to the `clockevents` framework Introduction to the `clockevents` framework
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is fifth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel. As you might noted from the title of this part, the `clockevents` framework will be discussed. We already saw one framework in the [second](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-2.html) part of this chapter. It was `clocksource` framework. Both of these frameworks represent timekeeping abstractions in the Linux kernel. This is fifth part of the [chapter](https://0xax.gitbook.io/linux-insides/summary/timers/) which describes timers and time management related stuff in the Linux kernel. As you might noted from the title of this part, the `clockevents` framework will be discussed. We already saw one framework in the [second](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-2) part of this chapter. It was `clocksource` framework. Both of these frameworks represent timekeeping abstractions in the Linux kernel.
At first let's refresh your memory and try to remember what is it `clocksource` framework and and what its purpose. The main goal of the `clocksource` framework is to provide `timeline`. As described in the [documentation](https://github.com/0xAX/linux/blob/0a07b238e5f488b459b6113a62e06b6aab017f71/Documentation/timers/timekeeping.txt): At first let's refresh your memory and try to remember what is it `clocksource` framework and and what its purpose. The main goal of the `clocksource` framework is to provide `timeline`. As described in the [documentation](https://github.com/0xAX/linux/blob/0a07b238e5f488b459b6113a62e06b6aab017f71/Documentation/timers/timekeeping.txt):
@ -130,7 +130,7 @@ The next two fields `shift` and `mult` are familiar to us. They will be used to
#define cpumask_of(cpu) (get_cpu_mask(cpu)) #define cpumask_of(cpu) (get_cpu_mask(cpu))
``` ```
Where the `get_cpu_mask` returns the cpumask containing just a given `cpu` number. More about `cpumasks` concept you may read in the [CPU masks in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) part. In the last four lines of code we set callbacks for the clock event device suspend/resume, device shutdown and update of the clock event device state. Where the `get_cpu_mask` returns the cpumask containing just a given `cpu` number. More about `cpumasks` concept you may read in the [CPU masks in the Linux kernel](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-2) part. In the last four lines of code we set callbacks for the clock event device suspend/resume, device shutdown and update of the clock event device state.
After we finished with the initialization of the `at91sam926x` periodic timer, we can register it by the call of the following functions: After we finished with the initialization of the `at91sam926x` periodic timer, we can register it by the call of the following functions:
@ -185,7 +185,7 @@ if (!dev->cpumask) {
} }
``` ```
Remember that we have set the `cpumask` of the `at91sam926x` periodic timer to first processor. If the `cpumask` field is zero, we check the number of possible processors in the system and print warning message if it is less than on. Additionally we set the `cpumask` of the given clock event device to the current processor. If you are interested in how the `smp_processor_id` macro is implemented, you can read more about it in the fourth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process chapter. Remember that we have set the `cpumask` of the `at91sam926x` periodic timer to first processor. If the `cpumask` field is zero, we check the number of possible processors in the system and print warning message if it is less than on. Additionally we set the `cpumask` of the given clock event device to the current processor. If you are interested in how the `smp_processor_id` macro is implemented, you can read more about it in the fourth [part](https://0xax.gitbook.io/linux-insides/summary/initialization/linux-initialization-4) of the Linux kernel initialization process chapter.
After this check we lock the actual code of the clock event device registration by the call following macros: After this check we lock the actual code of the clock event device registration by the call following macros:
@ -385,7 +385,7 @@ That's all.
Conclusion Conclusion
------------------------------------------------------------------------------- -------------------------------------------------------------------------------
This is the end of the fifth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) that describes timers and timer management related stuff in the Linux kernel. In the previous part got acquainted with the `timers` concept. In this part we continued to learn time management related stuff in the Linux kernel and saw a little about yet another framework - `clockevents`. This is the end of the fifth part of the [chapter](https://0xax.gitbook.io/linux-insides/summary/timers/) that describes timers and timer management related stuff in the Linux kernel. In the previous part got acquainted with the `timers` concept. In this part we continued to learn time management related stuff in the Linux kernel and saw a little about yet another framework - `clockevents`.
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new). If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
@ -409,7 +409,7 @@ Links
* [local APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) * [local APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)
* [C3 state](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface#Device_states) * [C3 state](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface#Device_states)
* [Periodic Interval Timer (PIT) for at91sam926x](http://www.atmel.com/Images/doc6062.pdf) * [Periodic Interval Timer (PIT) for at91sam926x](http://www.atmel.com/Images/doc6062.pdf)
* [CPU masks in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Concepts/linux-cpu-2.html) * [CPU masks in the Linux kernel](https://0xax.gitbook.io/linux-insides/summary/concepts/linux-cpu-2)
* [deadlock](https://en.wikipedia.org/wiki/Deadlock) * [deadlock](https://en.wikipedia.org/wiki/Deadlock)
* [CPU hotplug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) * [CPU hotplug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-3.html) * [previous part](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-3)

View File

@ -4,7 +4,7 @@ Timers and time management in the Linux kernel. Part 6.
x86_64 related clock sources x86_64 related clock sources
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is sixth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-5.html) we saw `clockevents` framework and now we will continue to dive into time management related stuff in the Linux kernel. This part will describe implementation of [x86](https://en.wikipedia.org/wiki/X86) architecture related clock sources (more about `clocksource` concept you can read in the [second part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-2.html) of this chapter). This is sixth part of the [chapter](https://0xax.gitbook.io/linux-insides/summary/timers/) which describes timers and time management related stuff in the Linux kernel. In the previous [part](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-5) we saw `clockevents` framework and now we will continue to dive into time management related stuff in the Linux kernel. This part will describe implementation of [x86](https://en.wikipedia.org/wiki/X86) architecture related clock sources (more about `clocksource` concept you can read in the [second part](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-2) of this chapter).
First of all we must know what clock sources may be used at `x86` architecture. It is easy to know from the [sysfs](https://en.wikipedia.org/wiki/Sysfs) or from content of the `/sys/devices/system/clocksource/clocksource0/available_clocksource`. The `/sys/devices/system/clocksource/clocksourceN` provides two special files to achieve this: First of all we must know what clock sources may be used at `x86` architecture. It is easy to know from the [sysfs](https://en.wikipedia.org/wiki/Sysfs) or from content of the `/sys/devices/system/clocksource/clocksource0/available_clocksource`. The `/sys/devices/system/clocksource/clocksourceN` provides two special files to achieve this:
@ -31,7 +31,7 @@ $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc tsc
``` ```
For me it is [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter). As we may know from the [second part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-2.html) of this chapter, which describes internals of the `clocksource` framework in the Linux kernel, the best clock source in a system is a clock source with the best (highest) rating or in other words with the highest [frequency](https://en.wikipedia.org/wiki/Frequency). For me it is [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter). As we may know from the [second part](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-2) of this chapter, which describes internals of the `clocksource` framework in the Linux kernel, the best clock source in a system is a clock source with the best (highest) rating or in other words with the highest [frequency](https://en.wikipedia.org/wiki/Frequency).
Frequency of the [ACPI](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) power management timer is `3.579545 MHz`. Frequency of the [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) is at least `10 MHz`. And the frequency of the [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) depends on processor. For example On older processors, the `Time Stamp Counter` was counting internal processor clock cycles. This means its frequency changed when the processor's frequency scaling changed. The situation has changed for newer processors. Newer processors have an `invariant Time Stamp counter` that increments at a constant rate in all operational states of processor. Actually we can get its frequency in the output of the `/proc/cpuinfo`. For example for the first processor in the system: Frequency of the [ACPI](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) power management timer is `3.579545 MHz`. Frequency of the [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) is at least `10 MHz`. And the frequency of the [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) depends on processor. For example On older processors, the `Time Stamp Counter` was counting internal processor clock cycles. This means its frequency changed when the processor's frequency scaling changed. The situation has changed for newer processors. Newer processors have an `invariant Time Stamp counter` that increments at a constant rate in all operational states of processor. Actually we can get its frequency in the output of the `/proc/cpuinfo`. For example for the first processor in the system:
@ -388,7 +388,7 @@ That's all.
Conclusion Conclusion
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is the end of the sixth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) that describes timers and timer management related stuff in the Linux kernel. In the previous part got acquainted with the `clockevents` framework. In this part we continued to learn time management related stuff in the Linux kernel and saw a little about three different clock sources which are used in the [x86](https://en.wikipedia.org/wiki/X86) architecture. The next part will be last part of this [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) and we will see some user space related stuff, i.e. how some time related [system calls](https://en.wikipedia.org/wiki/System_call) implemented in the Linux kernel. This is the end of the sixth part of the [chapter](https://0xax.gitbook.io/linux-insides/summary/timers/) that describes timers and timer management related stuff in the Linux kernel. In the previous part got acquainted with the `clockevents` framework. In this part we continued to learn time management related stuff in the Linux kernel and saw a little about three different clock sources which are used in the [x86](https://en.wikipedia.org/wiki/X86) architecture. The next part will be last part of this [chapter](https://0xax.gitbook.io/linux-insides/summary/timers/) and we will see some user space related stuff, i.e. how some time related [system calls](https://en.wikipedia.org/wiki/System_call) implemented in the Linux kernel.
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new). If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
@ -410,4 +410,4 @@ Links
* [IRQ0](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29#Master_PIC) * [IRQ0](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29#Master_PIC)
* [i8259](https://en.wikipedia.org/wiki/Intel_8259) * [i8259](https://en.wikipedia.org/wiki/Intel_8259)
* [initcall](https://kernelnewbies.org/Documents/InitcallMechanism) * [initcall](https://kernelnewbies.org/Documents/InitcallMechanism)
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-5.html) * [previous part](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-5)

View File

@ -4,7 +4,7 @@ Timers and time management in the Linux kernel. Part 7.
Time related system calls in the Linux kernel Time related system calls in the Linux kernel
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is the seventh and last part [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html), which describes timers and time management related stuff in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-6.html), we discussed timers in the context of [x86_64](https://en.wikipedia.org/wiki/X86-64): [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) and [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter). Internal time management is an interesting part of the Linux kernel, but of course not only the kernel needs the `time` concept. Our programs also need to know time. In this part, we will consider implementation of some time management related [system calls](https://en.wikipedia.org/wiki/System_call). These system calls are: This is the seventh and last part [chapter](https://0xax.gitbook.io/linux-insides/summary/timers/), which describes timers and time management related stuff in the Linux kernel. In the previous [part](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-6), we discussed timers in the context of [x86_64](https://en.wikipedia.org/wiki/X86-64): [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) and [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter). Internal time management is an interesting part of the Linux kernel, but of course not only the kernel needs the `time` concept. Our programs also need to know time. In this part, we will consider implementation of some time management related [system calls](https://en.wikipedia.org/wiki/System_call). These system calls are:
* `clock_gettime`; * `clock_gettime`;
* `gettimeofday`; * `gettimeofday`;
@ -12,7 +12,7 @@ This is the seventh and last part [chapter](https://0xax.gitbooks.io/linux-insid
We will start from a simple userspace [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) program and see all way from the call of the [standard library](https://en.wikipedia.org/wiki/Standard_library) function to the implementation of certain system calls. As each [architecture](https://github.com/torvalds/linux/tree/master/arch) provides its own implementation of certain system calls, we will consider only [x86_64](https://en.wikipedia.org/wiki/X86-64) specific implementations of system calls, as this book is related to this architecture. We will start from a simple userspace [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) program and see all way from the call of the [standard library](https://en.wikipedia.org/wiki/Standard_library) function to the implementation of certain system calls. As each [architecture](https://github.com/torvalds/linux/tree/master/arch) provides its own implementation of certain system calls, we will consider only [x86_64](https://en.wikipedia.org/wiki/X86-64) specific implementations of system calls, as this book is related to this architecture.
Additionally, we will not consider the concept of system calls in this part, but only implementations of these three system calls in the Linux kernel. If you are interested in what is a `system call`, there is a special [chapter](https://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) about this. Additionally, we will not consider the concept of system calls in this part, but only implementations of these three system calls in the Linux kernel. If you are interested in what is a `system call`, there is a special [chapter](https://0xax.gitbook.io/linux-insides/summary/syscall) about this.
So, let's start from the `gettimeofday` system call. So, let's start from the `gettimeofday` system call.
@ -57,7 +57,7 @@ The second parameter of the `gettimeofday` function is a pointer to the `timezon
Current date/time: 03-26-2016/16:42:02 Current date/time: 03-26-2016/16:42:02
``` ```
As you may already know, a userspace application does not call a system call directly from the kernel space. Before the actual system call entry will be called, we call a function from the standard library. In my case it is [glibc](https://en.wikipedia.org/wiki/GNU_C_Library), so I will consider this case. The implementation of the `gettimeofday` function is located in the [sysdeps/unix/sysv/linux/x86/gettimeofday.c](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86/gettimeofday.c;h=36f7c26ffb0e818709d032c605fec8c4bd22a14e;hb=HEAD) source code file. As you already may know, the `gettimeofday` is not a usual system call. It is located in the special area which is called `vDSO` (you can read more about it in the [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-3.html), which describes this concept). As you may already know, a userspace application does not call a system call directly from the kernel space. Before the actual system call entry will be called, we call a function from the standard library. In my case it is [glibc](https://en.wikipedia.org/wiki/GNU_C_Library), so I will consider this case. The implementation of the `gettimeofday` function is located in the [sysdeps/unix/sysv/linux/x86/gettimeofday.c](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86/gettimeofday.c;h=36f7c26ffb0e818709d032c605fec8c4bd22a14e;hb=HEAD) source code file. As you already may know, the `gettimeofday` is not a usual system call. It is located in the special area which is called `vDSO` (you can read more about it in the [part](https://0xax.gitbook.io/linux-insides/summary/syscall/linux-syscall-3), which describes this concept).
The `glibc` implementation of `gettimeofday` tries to resolve the given symbol; in our case this symbol is `__vdso_gettimeofday` by the call of the `_dl_vdso_vsym` internal function. If the symbol cannot be resolved, it returns `NULL` and we fallback to the call of the usual system call: The `glibc` implementation of `gettimeofday` tries to resolve the given symbol; in our case this symbol is `__vdso_gettimeofday` by the call of the `_dl_vdso_vsym` internal function. If the symbol cannot be resolved, it returns `NULL` and we fallback to the call of the usual system call:
@ -356,7 +356,7 @@ SYSCALL_DEFINE2(nanosleep, struct timespec __user *, rqtp,
} }
``` ```
More about the `SYSCALL_DEFINE2` macro you may read in the [chapter](https://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) about system calls. If we look at the implementation of the `nanosleep` system call, first of all we will see that it starts from the call of the `copy_from_user` function. This function copies the given data from the userspace to kernelspace. In our case we copy timeout value to sleep to the kernelspace `timespec` structure and check that the given `timespec` is valid by the call of the `timesc_valid` function: More about the `SYSCALL_DEFINE2` macro you may read in the [chapter](https://0xax.gitbook.io/linux-insides/summary/syscall) about system calls. If we look at the implementation of the `nanosleep` system call, first of all we will see that it starts from the call of the `copy_from_user` function. This function copies the given data from the userspace to kernelspace. In our case we copy timeout value to sleep to the kernelspace `timespec` structure and check that the given `timespec` is valid by the call of the `timesc_valid` function:
```C ```C
static inline bool timespec_valid(const struct timespec *ts) static inline bool timespec_valid(const struct timespec *ts)
@ -369,7 +369,7 @@ static inline bool timespec_valid(const struct timespec *ts)
} }
``` ```
which just checks that the given `timespec` does not represent date before `1970` and nanoseconds does not overflow `1` second. The `nanosleep` function ends with the call of the `hrtimer_nanosleep` function from the same source code file. The `hrtimer_nanosleep` function creates a [timer](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-4.html) and calls the `do_nanosleep` function. The `do_nanosleep` does main job for us. This function provides loop: which just checks that the given `timespec` does not represent date before `1970` and nanoseconds does not overflow `1` second. The `nanosleep` function ends with the call of the `hrtimer_nanosleep` function from the same source code file. The `hrtimer_nanosleep` function creates a [timer](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-4) and calls the `do_nanosleep` function. The `do_nanosleep` does main job for us. This function provides loop:
```C ```C
do { do {
@ -392,7 +392,7 @@ That's all.
Conclusion Conclusion
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is the end of the seventh part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) that describes timers and timer management related stuff in the Linux kernel. In the previous part we saw [x86_64](https://en.wikipedia.org/wiki/X86-64) specific clock sources. As I wrote in the beginning, this part is the last part of this chapter. We saw important time management related concepts like `clocksource` and `clockevents` frameworks, `jiffies` counter and etc., in this chpater. Of course this does not cover all of the time management in the Linux kernel. Many parts of this mostly related to the scheduling which we will see in other chapter. This is the end of the seventh part of the [chapter](https://0xax.gitbook.io/linux-insides/summary/timers/) that describes timers and timer management related stuff in the Linux kernel. In the previous part we saw [x86_64](https://en.wikipedia.org/wiki/X86-64) specific clock sources. As I wrote in the beginning, this part is the last part of this chapter. We saw important time management related concepts like `clocksource` and `clockevents` frameworks, `jiffies` counter and etc., in this chpater. Of course this does not cover all of the time management in the Linux kernel. Many parts of this mostly related to the scheduling which we will see in other chapter.
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new). If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
@ -412,10 +412,10 @@ Links
* [register](https://en.wikipedia.org/wiki/Processor_register) * [register](https://en.wikipedia.org/wiki/Processor_register)
* [System V Application Binary Interface](http://www.x86-64.org/documentation/abi.pdf) * [System V Application Binary Interface](http://www.x86-64.org/documentation/abi.pdf)
* [context switch](https://en.wikipedia.org/wiki/Context_switch) * [context switch](https://en.wikipedia.org/wiki/Context_switch)
* [Introduction to timers in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-4.html) * [Introduction to timers in the Linux kernel](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-4)
* [uptime](https://en.wikipedia.org/wiki/Uptime#Using_uptime) * [uptime](https://en.wikipedia.org/wiki/Uptime#Using_uptime)
* [system calls table for x86_64](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/syscalls/syscall_64.tbl) * [system calls table for x86_64](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/syscalls/syscall_64.tbl)
* [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) * [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer)
* [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) * [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter)
* [x86_64](https://en.wikipedia.org/wiki/X86-64) * [x86_64](https://en.wikipedia.org/wiki/X86-64)
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/linux-timers-6.html) * [previous part](https://0xax.gitbook.io/linux-insides/summary/timers/linux-timers-6)