commit
b10d0d80c1
@ -1,10 +1,10 @@
|
||||
# Kernel boot process
|
||||
# Kernel Boot Process
|
||||
|
||||
This chapter describes the linux kernel boot process. You will see here a
|
||||
couple of posts which describe the full cycle of the kernel loading process:
|
||||
This chapter describes the linux kernel boot process. Here you will see a
|
||||
couple of posts which describes the full cycle of the kernel loading process:
|
||||
|
||||
* [From the bootloader to kernel](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-1.html) - describes all stages from turning on the computer to before the first instruction of the kernel;
|
||||
* [First steps in the kernel setup code](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) - describes first steps in the kernel setup code. You will see heap initialization, querying of different parameters like EDD, IST and etc...
|
||||
* [From the bootloader to kernel](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-1.html) - describes all stages from turning on the computer to running the first instruction of the kernel.
|
||||
* [First steps in the kernel setup code](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) - describes first steps in the kernel setup code. You will see heap initialization, query of different parameters like EDD, IST and etc...
|
||||
* [Video mode initialization and transition to protected mode](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html) - describes video mode initialization in the kernel setup code and transition to protected mode.
|
||||
* [Transition to 64-bit mode](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-4.html) - describes preparation for transition into 64-bit mode and transition into it.
|
||||
* [Kernel Decompression](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) - describes preparation before kernel decompression and directly decompression.
|
||||
* [Transition to 64-bit mode](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-4.html) - describes preparation for transition into 64-bit mode and details of transition.
|
||||
* [Kernel Decompression](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) - describes preparation before kernel decompression and details of direct decompression.
|
||||
|
@ -0,0 +1,395 @@
|
||||
The initcall mechanism
|
||||
================================================================================
|
||||
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As you may understand from the title, this part will cover an interesting and important concept in the Linux kernel which is called - `initcall`. We already saw definitions like these:
|
||||
|
||||
```C
|
||||
early_param("debug", debug_kernel);
|
||||
```
|
||||
|
||||
or
|
||||
|
||||
```C
|
||||
arch_initcall(init_pit_clocksource);
|
||||
```
|
||||
|
||||
in some parts of the Linux kernel. Before we see how this mechanism is implemented in the Linux kernel, we must know actually what is it and how the Linux kernel uses it. Definitions like these represent a [callback](https://en.wikipedia.org/wiki/Callback_%28computer_programming%29) function which will be called either during initialization of the Linux kernel or right after it. Actually the main point of the `initcall` mechanism is to determine correct order of the built-in modules and subsystems initialization. For example let's look at the following function:
|
||||
|
||||
```C
|
||||
static int __init nmi_warning_debugfs(void)
|
||||
{
|
||||
debugfs_create_u64("nmi_longest_ns", 0644,
|
||||
arch_debugfs_dir, &nmi_longest_ns);
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
from the [arch/x86/kernel/nmi.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/nmi.c) source code file. As we may see it just creates the `nmi_longest_ns` [debugfs](https://en.wikipedia.org/wiki/Debugfs) file in the `arch_debugfs_dir` directory. Actually, this `debugfs` file may be created only after the `arch_debugfs_dir` will be created. Creation of this directory occurs during the architecture-specific initialization of the Linux kernel. Actually this directory will be created in the `arch_kdebugfs_init` function from the [arch/x86/kernel/kdebugfs.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/kdebugfs.c) source code file. Note that the `arch_kdebugfs_init` function is marked as `initcall` too:
|
||||
|
||||
```C
|
||||
arch_initcall(arch_kdebugfs_init);
|
||||
```
|
||||
|
||||
The Linux kernel calls all architecture-specific `initcalls` before the `fs` related `initcalls`. So, our `nmi_longest_ns` file will be created only after the `arch_kdebugfs_dir` directory will be created. Actually, the Linux kernel provides eight levels of main `initcalls`:
|
||||
|
||||
* `early`;
|
||||
* `core`;
|
||||
* `postcore`;
|
||||
* `arch`;
|
||||
* `susys`;
|
||||
* `fs`;
|
||||
* `device`;
|
||||
* `late`.
|
||||
|
||||
All of their names are represented by the `initcall_level_names` array which is defined in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file:
|
||||
|
||||
```C
|
||||
static char *initcall_level_names[] __initdata = {
|
||||
"early",
|
||||
"core",
|
||||
"postcore",
|
||||
"arch",
|
||||
"subsys",
|
||||
"fs",
|
||||
"device",
|
||||
"late",
|
||||
};
|
||||
```
|
||||
|
||||
All functions which are marked as `initcall` by these identifiers, will be called in the same order or at first `early initcalls` will be called, at second `core initcalls` and etc. From this moment we know a little about `initcall` mechanism, so we can start to dive into the source code of the Linux kernel to see how this mechanism is implemented.
|
||||
|
||||
Implementation initcall mechanism in the Linux kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The Linux kernel provides a set of macros from the [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h) header file to mark a given function as `initcall`. All of these macros are pretty simple:
|
||||
|
||||
```C
|
||||
#define early_initcall(fn) __define_initcall(fn, early)
|
||||
#define core_initcall(fn) __define_initcall(fn, 1)
|
||||
#define postcore_initcall(fn) __define_initcall(fn, 2)
|
||||
#define arch_initcall(fn) __define_initcall(fn, 3)
|
||||
#define subsys_initcall(fn) __define_initcall(fn, 4)
|
||||
#define fs_initcall(fn) __define_initcall(fn, 5)
|
||||
#define device_initcall(fn) __define_initcall(fn, 6)
|
||||
#define late_initcall(fn) __define_initcall(fn, 7)
|
||||
```
|
||||
|
||||
and as we may see these macros just expand to the call of the `__define_initcall` macro from the same header file. Moreover, the `__define_initcall` macro takes two arguments:
|
||||
|
||||
* `fn` - callback function which will be called during call of `initcalls` of the certain level;
|
||||
* `id` - identifier to identify `initcall` to prevent error when two the same `initcalls` point to the same handler.
|
||||
|
||||
The implementation of the `__define_initcall` macro looks like:
|
||||
|
||||
```C
|
||||
#define __define_initcall(fn, id) \
|
||||
static initcall_t __initcall_##fn##id __used \
|
||||
__attribute__((__section__(".initcall" #id ".init"))) = fn; \
|
||||
LTO_REFERENCE_INITCALL(__initcall_##fn##id)
|
||||
```
|
||||
|
||||
To understand the `__define_initcall` macro, first of all let's look at the `initcall_t` type. This type is defined in the same [header]() file and it represents pointer to a function which returns pointer to [integer](https://en.wikipedia.org/wiki/Integer) which will be result of the `initcall`:
|
||||
|
||||
```C
|
||||
typedef int (*initcall_t)(void);
|
||||
```
|
||||
|
||||
Now let's return to the `_-define_initcall` macro. The [##](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html) provides ability to concatenate two symbols. In our case, the first line of the `__define_initcall` macro produces definition of the given function which is located in the `.initcall id .init` [ELF section](http://www.skyfree.org/linux/references/ELF_Format.pdf) and marked with the following [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) attributes: `__initcall_function_name_id` and `__used`. If we will look in the [include/asm-generic/vmlinux.lds.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/vmlinux.lds.h) header file which represents data for the kernel [linker](https://en.wikipedia.org/wiki/Linker_%28computing%29) script, we will see that all of `initcalls` sections will be placed in the `.data` section:
|
||||
|
||||
```C
|
||||
#define INIT_CALLS \
|
||||
VMLINUX_SYMBOL(__initcall_start) = .; \
|
||||
*(.initcallearly.init) \
|
||||
INIT_CALLS_LEVEL(0) \
|
||||
INIT_CALLS_LEVEL(1) \
|
||||
INIT_CALLS_LEVEL(2) \
|
||||
INIT_CALLS_LEVEL(3) \
|
||||
INIT_CALLS_LEVEL(4) \
|
||||
INIT_CALLS_LEVEL(5) \
|
||||
INIT_CALLS_LEVEL(rootfs) \
|
||||
INIT_CALLS_LEVEL(6) \
|
||||
INIT_CALLS_LEVEL(7) \
|
||||
VMLINUX_SYMBOL(__initcall_end) = .;
|
||||
|
||||
#define INIT_DATA_SECTION(initsetup_align) \
|
||||
.init.data : AT(ADDR(.init.data) - LOAD_OFFSET) { \
|
||||
... \
|
||||
INIT_CALLS \
|
||||
... \
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
The second attribute - `__used` is defined in the [include/linux/compiler-gcc.h](https://github.com/torvalds/linux/blob/master/include/linux/compiler-gcc.h) header file and it expands to the definition of the following `gcc` attribute:
|
||||
|
||||
```C
|
||||
#define __used __attribute__((__used__))
|
||||
```
|
||||
|
||||
which prevents `variable defined but not used` warning. The last line of the `__define_initcall` macro is:
|
||||
|
||||
```C
|
||||
LTO_REFERENCE_INITCALL(__initcall_##fn##id)
|
||||
```
|
||||
|
||||
depends on the `CONFIG_LTO` kernel configuration option and just provides stub for the compiler [Link time optimization](https://gcc.gnu.org/wiki/LinkTimeOptimization):
|
||||
|
||||
```
|
||||
#ifdef CONFIG_LTO
|
||||
#define LTO_REFERENCE_INITCALL(x) \
|
||||
static __used __exit void *reference_##x(void) \
|
||||
{ \
|
||||
return &x; \
|
||||
}
|
||||
#else
|
||||
#define LTO_REFERENCE_INITCALL(x)
|
||||
#endif
|
||||
```
|
||||
|
||||
In order to prevent any problem when there is no reference to a variable in a module, it will be moved to the end of the program. That's all about the `__define_initcall` macro. So, all of the `*_initcall` macros will be expanded during compilation of the Linux kernel, and all `initcalls` will be placed in their sections and all of them will be available from the `.data` section and the Linux kernel will know where to find a certain `initcall` to call it during initialization process.
|
||||
|
||||
As `initcalls` can be called by the Linux kernel, let's look how the Linux kernel does this. This process starts in the `do_basic_setup` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file:
|
||||
|
||||
```C
|
||||
static void __init do_basic_setup(void)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
do_initcalls();
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
which is called during the initialization of the Linux kernel, right after main steps of initialization like memory manager related initialization, `CPU` subsystem and other already finished. The `do_initcalls` function just goes through the array of `initcall` levels and call the `do_initcall_level` function for each level:
|
||||
|
||||
```C
|
||||
static void __init do_initcalls(void)
|
||||
{
|
||||
int level;
|
||||
|
||||
for (level = 0; level < ARRAY_SIZE(initcall_levels) - 1; level++)
|
||||
do_initcall_level(level);
|
||||
}
|
||||
```
|
||||
|
||||
The `initcall_levels` array is defined in the same source code [file](https://github.com/torvalds/linux/blob/master/init/main.c) and contains pointers to the sections which were defined in the `__define_initcall` macro:
|
||||
|
||||
```C
|
||||
static initcall_t *initcall_levels[] __initdata = {
|
||||
__initcall0_start,
|
||||
__initcall1_start,
|
||||
__initcall2_start,
|
||||
__initcall3_start,
|
||||
__initcall4_start,
|
||||
__initcall5_start,
|
||||
__initcall6_start,
|
||||
__initcall7_start,
|
||||
__initcall_end,
|
||||
};
|
||||
```
|
||||
|
||||
If you are interested, you can find these sections in the `arch/x86/kernel/vmlinux.lds` linker script which is generated after the Linux kernel compilation:
|
||||
|
||||
```
|
||||
.init.data : AT(ADDR(.init.data) - 0xffffffff80000000) {
|
||||
...
|
||||
...
|
||||
...
|
||||
...
|
||||
__initcall_start = .;
|
||||
*(.initcallearly.init)
|
||||
__initcall0_start = .;
|
||||
*(.initcall0.init)
|
||||
*(.initcall0s.init)
|
||||
__initcall1_start = .;
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
If you are not familiar with this then you can know more about [linkers](https://en.wikipedia.org/wiki/Linker_%28computing%29) in the special [part](https://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html) of this book.
|
||||
|
||||
As we just saw, the `do_initcall_level` function takes one parameter - level of `initcall` and does following two things: First of all this function parses the `initcall_command_line` which is copy of usual kernel [command line](https://www.kernel.org/doc/Documentation/kernel-parameters.txt) which may contain parameters for modules with the `parse_args` function from the [kernel/params.c](https://github.com/torvalds/linux/blob/master/kernel/params.c) source code file and call the `do_on_initcall` function for each level:
|
||||
|
||||
```C
|
||||
for (fn = initcall_levels[level]; fn < initcall_levels[level+1]; fn++)
|
||||
do_one_initcall(*fn);
|
||||
```
|
||||
|
||||
The `do_on_initcall` does main job for us. As we may see, this function takes one parameter which represent `initcall` callback function and does the call of the given callback:
|
||||
|
||||
```C
|
||||
int __init_or_module do_one_initcall(initcall_t fn)
|
||||
{
|
||||
int count = preempt_count();
|
||||
int ret;
|
||||
char msgbuf[64];
|
||||
|
||||
if (initcall_blacklisted(fn))
|
||||
return -EPERM;
|
||||
|
||||
if (initcall_debug)
|
||||
ret = do_one_initcall_debug(fn);
|
||||
else
|
||||
ret = fn();
|
||||
|
||||
msgbuf[0] = 0;
|
||||
|
||||
if (preempt_count() != count) {
|
||||
sprintf(msgbuf, "preemption imbalance ");
|
||||
preempt_count_set(count);
|
||||
}
|
||||
if (irqs_disabled()) {
|
||||
strlcat(msgbuf, "disabled interrupts ", sizeof(msgbuf));
|
||||
local_irq_enable();
|
||||
}
|
||||
WARN(msgbuf[0], "initcall %pF returned with %s\n", fn, msgbuf);
|
||||
|
||||
return ret;
|
||||
}
|
||||
```
|
||||
|
||||
Let's try to understand what does the `do_on_initcall` function does. First of all we increase [preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29) counter so that we can check it later to be sure that it is not imbalanced. After this step we can see the call of the `initcall_backlist` function which
|
||||
goes over the `blacklisted_initcalls` list which stores blacklisted `initcalls` and releases the given `initcall` if it is located in this list:
|
||||
|
||||
```C
|
||||
list_for_each_entry(entry, &blacklisted_initcalls, next) {
|
||||
if (!strcmp(fn_name, entry->buf)) {
|
||||
pr_debug("initcall %s blacklisted\n", fn_name);
|
||||
kfree(fn_name);
|
||||
return true;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The blacklisted `initcalls` stored in the `blacklisted_initcalls` list and this list is filled during early Linux kernel initialization from the Linux kernel command line.
|
||||
|
||||
After the blacklisted `initcalls` will be handled, the next part of code does directly the call of the `initcall`:
|
||||
|
||||
```C
|
||||
if (initcall_debug)
|
||||
ret = do_one_initcall_debug(fn);
|
||||
else
|
||||
ret = fn();
|
||||
```
|
||||
|
||||
Depends on the value of the `initcall_debug` variable, the `do_one_initcall_debug` function will call `initcall` or this function will do it directly via `fn()`. The `initcall_debug` variable is defined in the [same](https://github.com/torvalds/linux/blob/master/init/main.c) source code file:
|
||||
|
||||
```C
|
||||
bool initcall_debug;
|
||||
```
|
||||
|
||||
and provides ability to print some information to the kernel [log buffer](https://en.wikipedia.org/wiki/Dmesg). The value of the variable can be set from the kernel commands via the `initcall_debug` parameter. As we can read from the [documentation](https://www.kernel.org/doc/Documentation/kernel-parameters.txt) of the Linux kernel command line:
|
||||
|
||||
```
|
||||
initcall_debug [KNL] Trace initcalls as they are executed. Useful
|
||||
for working out where the kernel is dying during
|
||||
startup.
|
||||
```
|
||||
|
||||
And that's true. If we will look at the implementation of the `do_one_initcall_debug` function, we will see that it does the same as the `do_one_initcall` function or i.e. the `do_one_initcall_debug` function calls the given `initcall` and prints some information (like the [pid](https://en.wikipedia.org/wiki/Process_identifier) of the currently running task, duration of execution of the `initcall` and etc.) related to the execution of the given `initcall`:
|
||||
|
||||
```C
|
||||
static int __init_or_module do_one_initcall_debug(initcall_t fn)
|
||||
{
|
||||
ktime_t calltime, delta, rettime;
|
||||
unsigned long long duration;
|
||||
int ret;
|
||||
|
||||
printk(KERN_DEBUG "calling %pF @ %i\n", fn, task_pid_nr(current));
|
||||
calltime = ktime_get();
|
||||
ret = fn();
|
||||
rettime = ktime_get();
|
||||
delta = ktime_sub(rettime, calltime);
|
||||
duration = (unsigned long long) ktime_to_ns(delta) >> 10;
|
||||
printk(KERN_DEBUG "initcall %pF returned %d after %lld usecs\n",
|
||||
fn, ret, duration);
|
||||
|
||||
return ret;
|
||||
}
|
||||
```
|
||||
|
||||
As an `initcall` was called by the one of the ` do_one_initcall` or `do_one_initcall_debug` functions, we may see two checks in the end of the `do_one_initcall` function. The first one checks the amount of possible `__preempt_count_add` and `__preempt_count_sub` calls inside of the executed initcall, and if this value is not equal to the previous value of the preemptible counter, we add the `preemption imbalance` string to the message buffer and set correct value of the preemptible counter:
|
||||
|
||||
```C
|
||||
if (preempt_count() != count) {
|
||||
sprintf(msgbuf, "preemption imbalance ");
|
||||
preempt_count_set(count);
|
||||
}
|
||||
```
|
||||
|
||||
Later this error string will be printed. The last check the state of local [IRQs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) and if they are disabled, we add the `disabled interrupts` strings to the our message buffer and enable `IRQs` for the current processor to prevent the state when `IRQs` were disabled by an `initcall` and didn't enable again:
|
||||
|
||||
```C
|
||||
if (irqs_disabled()) {
|
||||
strlcat(msgbuf, "disabled interrupts ", sizeof(msgbuf));
|
||||
local_irq_enable();
|
||||
}
|
||||
```
|
||||
|
||||
That's all. In this way the Linux kernel does initialization of many subsystems in a correct order. From now on, we know what is the `initcall` mechanism in the Linux kernel. In this part, we covered main general portion of the `initcall` mechanism but we left some important concepts. Let's make a short look at these concepts.
|
||||
|
||||
First of all, we have missed one level of `initcalls`, this is `rootfs initcalls`. You can find definition of the `rootfs_initcall` in the [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h) header file along with all similar macros which we saw in this part:
|
||||
|
||||
```C
|
||||
#define rootfs_initcall(fn) __define_initcall(fn, rootfs)
|
||||
```
|
||||
|
||||
As we may understand from the macro's name, its main purpose is to store callbacks which are related to the [rootfs](https://en.wikipedia.org/wiki/Initramfs). Besides this goal, it may be useful to initialize other stuffs after initialization related to filesystems level only if devices related stuff are not initialized. For example, the decompression of the [initramfs](https://en.wikipedia.org/wiki/Initramfs) which occurred in the `populate_rootfs` function from the [init/initramfs.c](https://github.com/torvalds/linux/blob/master/init/initramfs.c) source code file:
|
||||
|
||||
```C
|
||||
rootfs_initcall(populate_rootfs);
|
||||
```
|
||||
|
||||
From this place, we may see familiar output:
|
||||
|
||||
```
|
||||
[ 0.199960] Unpacking initramfs...
|
||||
```
|
||||
|
||||
Besides the `rootfs_initcall` level, there are additional `console_initcall`, `security_initcall` and other secondary `initcall` levels. The last thing that we have missed is the set of the `*_initcall_sync` levels. Almost each `*_initcall` macro that we have seen in this part, has macro companion with the `_sync` prefix:
|
||||
|
||||
```C
|
||||
#define core_initcall_sync(fn) __define_initcall(fn, 1s)
|
||||
#define postcore_initcall_sync(fn) __define_initcall(fn, 2s)
|
||||
#define arch_initcall_sync(fn) __define_initcall(fn, 3s)
|
||||
#define subsys_initcall_sync(fn) __define_initcall(fn, 4s)
|
||||
#define fs_initcall_sync(fn) __define_initcall(fn, 5s)
|
||||
#define device_initcall_sync(fn) __define_initcall(fn, 6s)
|
||||
#define late_initcall_sync(fn) __define_initcall(fn, 7s)
|
||||
```
|
||||
|
||||
The main goal of these additional levels is to wait for completion of all a module related initialization routines for a certain level.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In this part we saw the important mechanism of the Linux kernel which allows to call a function which depends on the current state of the Linux kernel during its initialization.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**.
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [callback](https://en.wikipedia.org/wiki/Callback_%28computer_programming%29)
|
||||
* [debugfs](https://en.wikipedia.org/wiki/Debugfs)
|
||||
* [integer type](https://en.wikipedia.org/wiki/Integer)
|
||||
* [symbols concatenation](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html)
|
||||
* [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)
|
||||
* [Link time optimization](https://gcc.gnu.org/wiki/LinkTimeOptimization)
|
||||
* [Introduction to linkers](https://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html)
|
||||
* [Linux kernel command line](https://www.kernel.org/doc/Documentation/kernel-parameters.txt)
|
||||
* [Process identifier](https://en.wikipedia.org/wiki/Process_identifier)
|
||||
* [IRQs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
|
||||
* [rootfs](https://en.wikipedia.org/wiki/Initramfs)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
@ -1,9 +1,10 @@
|
||||
Data Structures in the Linux Kernel
|
||||
========================================================================
|
||||
|
||||
Linux kernel provides implementations of a different data structures like linked list, B+ tree, prinority heap and many many more.
|
||||
Linux kernel provides different implementations of data structures like doubly linked list, B+ tree, priority heap and many many more.
|
||||
|
||||
This part considers these data structures and algorithms.
|
||||
This part considers the following data structures and algorithms:
|
||||
|
||||
* [Doubly linked list](https://github.com/0xAX/linux-insides/blob/master/DataStructures/dlist.md)
|
||||
* [Radix tree](https://github.com/0xAX/linux-insides/blob/master/DataStructures/radix-tree.md)
|
||||
* [Bit arrays](https://github.com/0xAX/linux-insides/blob/master/DataStructures/bitmap.md)
|
||||
|
@ -0,0 +1,384 @@
|
||||
Data Structures in the Linux Kernel
|
||||
================================================================================
|
||||
|
||||
Bit arrays and bit operations in the Linux kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Besides different [linked](https://en.wikipedia.org/wiki/Linked_data_structure) and [tree](https://en.wikipedia.org/wiki/Tree_%28data_structure%29) based data structures, the Linux kernel provides [API](https://en.wikipedia.org/wiki/Application_programming_interface) for [bit arrays](https://en.wikipedia.org/wiki/Bit_array) or `bitmap`. Bit arrays are heavily used in the Linux kernel and following source code files contain common `API` for work with such structures:
|
||||
|
||||
* [lib/bitmap.c](https://github.com/torvalds/linux/blob/master/lib/bitmap.c)
|
||||
* [include/linux/bitmap.h](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h)
|
||||
|
||||
Besides these two files, there is also architecture-specific header file which provides optimized bit operations for certain architecture. We consider [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture, so in our case it will be:
|
||||
|
||||
* [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h)
|
||||
|
||||
header file. As I just wrote above, the `bitmap` is heavily used in the Linux kernel. For example a `bit array` is used to store set of online/offline processors for systems which support [hot-plug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) cpu (more about this you can read in the [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) part), a `bit array` stores set of allocated [irqs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) during initialization of the Linux kernel and etc.
|
||||
|
||||
So, the main goal of this part is to see how `bit arrays` are implemented in the Linux kernel. Let's start.
|
||||
|
||||
Declaration of bit array
|
||||
================================================================================
|
||||
|
||||
Before we will look on `API` for bitmaps manipulation, we must know how to declare it in the Linux kernel. There are two common method to declare own bit array. The first simple way to declare a bit array is to array of `unsigned long`. For example:
|
||||
|
||||
```C
|
||||
unsigned long my_bitmap[8]
|
||||
```
|
||||
|
||||
The second way is to use the `DECLARE_BITMAP` macro which is defined in the [include/linux/types.h](https://github.com/torvalds/linux/blob/master/include/linux/types.h) header file:
|
||||
|
||||
```C
|
||||
#define DECLARE_BITMAP(name,bits) \
|
||||
unsigned long name[BITS_TO_LONGS(bits)]
|
||||
```
|
||||
|
||||
We can see that `DECLARE_BITMAP` macro takes two parameters:
|
||||
|
||||
* `name` - name of bitmap;
|
||||
* `bits` - amount of bits in bitmap;
|
||||
|
||||
and just expands to the definition of `unsigned long` array with `BITS_TO_LONGS(bits)` elements, where the `BITS_TO_LONGS` macro converts a given number of bits to number of `longs` or in other words it calculates how many `8` byte elements in `bits`:
|
||||
|
||||
```C
|
||||
#define BITS_PER_BYTE 8
|
||||
#define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
|
||||
#define BITS_TO_LONGS(nr) DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long))
|
||||
```
|
||||
|
||||
So, for example `DECLARE_BITMAP(my_bitmap, 64)` will produce:
|
||||
|
||||
```python
|
||||
>>> (((64) + (64) - 1) / (64))
|
||||
1
|
||||
```
|
||||
|
||||
and:
|
||||
|
||||
```C
|
||||
unsigned long my_bitmap[1];
|
||||
```
|
||||
|
||||
After we are able to declare a bit array, we can start to use it.
|
||||
|
||||
Architecture-specific bit operations
|
||||
================================================================================
|
||||
|
||||
We already saw above a couple of source code and header files which provide [API](https://en.wikipedia.org/wiki/Application_programming_interface) for manipulation of bit arrays. The most important and widely used API of bit arrays is architecture-specific and located as we already know in the [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) header file.
|
||||
|
||||
First of all let's look at the two most important functions:
|
||||
|
||||
* `set_bit`;
|
||||
* `clear_bit`.
|
||||
|
||||
I think that there is no need to explain what these function do. This is already must be clear from their name. Let's look on their implementation. If you will look into the [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) header file, you will note that each of these functions represented by two variants: [atomic](https://en.wikipedia.org/wiki/Linearizability) and not. Before we will start to dive into implementations of these functions, first of all we must to know a little about `atomic` operations.
|
||||
|
||||
In simple words atomic operations guarantees that two or more operations will not be performed on the same data concurrently. The `x86` architecture provides a set of atomic instructions, for example [xchg](http://x86.renejeschke.de/html/file_module_x86_id_328.html) instruction, [cmpxchg](http://x86.renejeschke.de/html/file_module_x86_id_41.html) instruction and etc. Besides atomic instructions, some of non-atomic instructions can be made atomic with the help of the [lock](http://x86.renejeschke.de/html/file_module_x86_id_159.html) instruction. It is enough to know about atomic operations for now, so we can begin to consider implementation of `set_bit` and `clear_bit` functions.
|
||||
|
||||
First of all, let's start to consider `non-atomic` variants of this function. Names of non-atomic `set_bit` and `clear_bit` starts from double underscore. As we already know, all of these functions are defined in the [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) header file and the first function is `__set_bit`:
|
||||
|
||||
```C
|
||||
static inline void __set_bit(long nr, volatile unsigned long *addr)
|
||||
{
|
||||
asm volatile("bts %1,%0" : ADDR : "Ir" (nr) : "memory");
|
||||
}
|
||||
```
|
||||
|
||||
As we can see it takes two arguments:
|
||||
|
||||
* `nr` - number of bit in a bit array.
|
||||
* `addr` - address of a bit array where we need to set bit.
|
||||
|
||||
Note that the `addr` parameter is defined with `volatile` keyword which tells to compiler that value maybe changed by the given address. The implementation of the `__set_bit` is pretty easy. As we can see, it just contains one line of [inline assembler](https://en.wikipedia.org/wiki/Inline_assembler) code. In our case we are using the [bts](http://x86.renejeschke.de/html/file_module_x86_id_25.html) instruction which selects a bit which is specified with the first operand (`nr` in our case) from the bit array, stores the value of the selected bit in the [CF](https://en.wikipedia.org/wiki/FLAGS_register) flags register and set this bit.
|
||||
|
||||
Note that we can see usage of the `nr`, but there is `addr` here. You already might guess that the secret is in `ADDR`. The `ADDR` is the macro which is defined in the same header code file and expands to the string which contains value of the given address and `+m` constraint:
|
||||
|
||||
```C
|
||||
#define ADDR BITOP_ADDR(addr)
|
||||
#define BITOP_ADDR(x) "+m" (*(volatile long *) (x))
|
||||
```
|
||||
|
||||
Besides the `+m`, we can see other constraints in the `__set_bit` function. Let's look on they and try to understand what do they mean:
|
||||
|
||||
* `+m` - represents memory operand where `+` tells that the given operand will be input and output operand;
|
||||
* `I` - represents integer constant;
|
||||
* `r` - represents register operand
|
||||
|
||||
Besides these constraint, we also can see - the `memory` keyword which tells compiler that this code will change value in memory. That's all. Now let's look at the same function but at `atomic` variant. It looks more complex that its `non-atomic` variant:
|
||||
|
||||
```C
|
||||
static __always_inline void
|
||||
set_bit(long nr, volatile unsigned long *addr)
|
||||
{
|
||||
if (IS_IMMEDIATE(nr)) {
|
||||
asm volatile(LOCK_PREFIX "orb %1,%0"
|
||||
: CONST_MASK_ADDR(nr, addr)
|
||||
: "iq" ((u8)CONST_MASK(nr))
|
||||
: "memory");
|
||||
} else {
|
||||
asm volatile(LOCK_PREFIX "bts %1,%0"
|
||||
: BITOP_ADDR(addr) : "Ir" (nr) : "memory");
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
First of all note that this function takes the same set of parameters that `__set_bit`, but additionally marked with the `__always_inline` attribute. The `__always_inline` is macro which defined in the [include/linux/compiler-gcc.h](https://github.com/torvalds/linux/blob/master/include/linux/compiler-gcc.h) and just expands to the `always_inline` attribute:
|
||||
|
||||
```C
|
||||
#define __always_inline inline __attribute__((always_inline))
|
||||
```
|
||||
|
||||
which means that this function will be always inlined to reduce size of the Linux kernel image. Now let's try to understand implementation of the `set_bit` function. First of all we check a given number of bit at the beginning of the `set_bit` function. The `IS_IMMEDIATE` macro defined in the same [header](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) file and expands to the call of the builtin [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) function:
|
||||
|
||||
```C
|
||||
#define IS_IMMEDIATE(nr) (__builtin_constant_p(nr))
|
||||
```
|
||||
|
||||
The `__builtin_constant_p` builtin function returns `1` if the given parameter is known to be constant at compile-time and returns `0` in other case. We no need to use slow `bts` instruction to set bit if the given number of bit is known in compile time constant. We can just apply [bitwise or](https://en.wikipedia.org/wiki/Bitwise_operation#OR) for byte from the give address which contains given bit and masked number of bits where high bit is `1` and other is zero. In other case if the given number of bit is not known constant at compile-time, we do the same as we did in the `__set_bit` function. The `CONST_MASK_ADDR` macro:
|
||||
|
||||
```C
|
||||
#define CONST_MASK_ADDR(nr, addr) BITOP_ADDR((void *)(addr) + ((nr)>>3))
|
||||
```
|
||||
|
||||
expands to the give address with offset to the byte which contains a given bit. For example we have address `0x1000` and the number of bit is `0x9`. So, as `0x9` is `one byte + one bit` our address with be `addr + 1`:
|
||||
|
||||
```python
|
||||
>>> hex(0x1000 + (0x9 >> 3))
|
||||
'0x1001'
|
||||
```
|
||||
|
||||
The `CONST_MASK` macro represents our given number of bit as byte where high bit is `1` and other bits are `0`:
|
||||
|
||||
```C
|
||||
#define CONST_MASK(nr) (1 << ((nr) & 7))
|
||||
```
|
||||
|
||||
```python
|
||||
>>> bin(1 << (0x9 & 7))
|
||||
'0b10'
|
||||
```
|
||||
|
||||
In the end we just apply bitwise `or` for these values. So, for example if our address will be `0x4097` and we need to set `0x9` bit:
|
||||
|
||||
```python
|
||||
>>> bin(0x4097)
|
||||
'0b100000010010111'
|
||||
>>> bin((0x4097 >> 0x9) | (1 << (0x9 & 7)))
|
||||
'0b100010'
|
||||
```
|
||||
|
||||
the `ninth` bit will be set.
|
||||
|
||||
Note that all of these operations are marked with `LOCK_PREFIX` which is expands to the [lock](http://x86.renejeschke.de/html/file_module_x86_id_159.html) instruction which guarantees atomicity of this operation.
|
||||
|
||||
As we already know, besides the `set_bit` and `__set_bit` operations, the Linux kernel provides two inverse functions to clear bit in atomic and non-atomic context. They are `clear_bit` and `__clear_bit`. Both of these functions are defined in the same [header file](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) and takes the same set of arguments. But not only arguments are similar. Generally these functions are very similar on the `set_bit` and `__set_bit`. Let's look on the implementation of the non-atomic `__clear_bit` function:
|
||||
|
||||
```C
|
||||
static inline void __clear_bit(long nr, volatile unsigned long *addr)
|
||||
{
|
||||
asm volatile("btr %1,%0" : ADDR : "Ir" (nr));
|
||||
}
|
||||
```
|
||||
|
||||
Yes. As we see, it takes the same set of arguments and contains very similar block of inline assembler. It just uses the [btr](http://x86.renejeschke.de/html/file_module_x86_id_24.html) instruction instead of `bts`. As we can understand form the function's name, it clears a given bit by the given address. The `btr` instruction acts like `bts`. This instruction also selects a given bit which is specified in the first operand, stores its value in the `CF` flag register and clears this bit in the given bit array which is specified with second operand.
|
||||
|
||||
The atomic variant of the `__clear_bit` is `clear_bit`:
|
||||
|
||||
```C
|
||||
static __always_inline void
|
||||
clear_bit(long nr, volatile unsigned long *addr)
|
||||
{
|
||||
if (IS_IMMEDIATE(nr)) {
|
||||
asm volatile(LOCK_PREFIX "andb %1,%0"
|
||||
: CONST_MASK_ADDR(nr, addr)
|
||||
: "iq" ((u8)~CONST_MASK(nr)));
|
||||
} else {
|
||||
asm volatile(LOCK_PREFIX "btr %1,%0"
|
||||
: BITOP_ADDR(addr)
|
||||
: "Ir" (nr));
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
and as we can see it is very similar on `set_bit` and just contains two differences. The first difference it uses `btr` instruction to clear bit when the `set_bit` uses `bts` instruction to set bit. The second difference it uses negated mask and `and` instruction to clear bit in the given byte when the `set_bit` uses `or` instruction.
|
||||
|
||||
That's all. Now we can set and clear bit in any bit array and and we can go to other operations on bitmasks.
|
||||
|
||||
Most widely used operations on a bit arrays are set and clear bit in a bit array in the Linux kernel. But besides this operations it is useful to do additional operations on a bit array. Yet another widely used operation in the Linux kernel - is to know is a given bit set or not in a bit array. We can achieve this with the help of the `test_bit` macro. This macro is defined in the [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) header file and expands to the call of the `constant_test_bit` or `variable_test_bit` depends on bit number:
|
||||
|
||||
```C
|
||||
#define test_bit(nr, addr) \
|
||||
(__builtin_constant_p((nr)) \
|
||||
? constant_test_bit((nr), (addr)) \
|
||||
: variable_test_bit((nr), (addr)))
|
||||
```
|
||||
|
||||
So, if the `nr` is known in compile time constant, the `test_bit` will be expanded to the call of the `constant_test_bit` function or `variable_test_bit` in other case. Now let's look at implementations of these functions. Let's start from the `variable_test_bit`:
|
||||
|
||||
```C
|
||||
static inline int variable_test_bit(long nr, volatile const unsigned long *addr)
|
||||
{
|
||||
int oldbit;
|
||||
|
||||
asm volatile("bt %2,%1\n\t"
|
||||
"sbb %0,%0"
|
||||
: "=r" (oldbit)
|
||||
: "m" (*(unsigned long *)addr), "Ir" (nr));
|
||||
|
||||
return oldbit;
|
||||
}
|
||||
```
|
||||
|
||||
The `variable_test_bit` function takes similar set of arguments as `set_bit` and other function take. We also may see inline assembly code here which executes [bt](http://x86.renejeschke.de/html/file_module_x86_id_22.html) and [sbb](http://x86.renejeschke.de/html/file_module_x86_id_286.html) instruction. The `bt` or `bit test` instruction selects a given bit which is specified with first operand from the bit array which is specified with the second operand and stores its value in the [CF](https://en.wikipedia.org/wiki/FLAGS_register) bit of flags register. The second `sbb` instruction subtracts first operand from second and subtracts value of the `CF`. So, here write a value of a given bit number from a given bit array to the `CF` bit of flags register and execute `sbb` instruction which calculates: `00000000 - CF` and writes the result to the `oldbit`.
|
||||
|
||||
The `constant_test_bit` function does the same as we saw in the `set_bit`:
|
||||
|
||||
```C
|
||||
static __always_inline int constant_test_bit(long nr, const volatile unsigned long *addr)
|
||||
{
|
||||
return ((1UL << (nr & (BITS_PER_LONG-1))) &
|
||||
(addr[nr >> _BITOPS_LONG_SHIFT])) != 0;
|
||||
}
|
||||
```
|
||||
|
||||
It generates a byte where high bit is `1` and other bits are `0` (as we saw in `CONST_MASK`) and applies bitwise [and](https://en.wikipedia.org/wiki/Bitwise_operation#AND) to the byte which contains a given bit number.
|
||||
|
||||
The next widely used bit array related operation is to change bit in a bit array. The Linux kernel provides two helper for this:
|
||||
|
||||
* `__change_bit`;
|
||||
* `change_bit`.
|
||||
|
||||
As you already can guess, these two variants are atomic and non-atomic as for example `set_bit` and `__set_bit`. For the start, let's look at the implementation of the `__change_bit` function:
|
||||
|
||||
```C
|
||||
static inline void __change_bit(long nr, volatile unsigned long *addr)
|
||||
{
|
||||
asm volatile("btc %1,%0" : ADDR : "Ir" (nr));
|
||||
}
|
||||
```
|
||||
|
||||
Pretty easy, is not it? The implementation of the `__change_bit` is the same as `__set_bit`, but instead of `bts` instruction, we are using [btc](http://x86.renejeschke.de/html/file_module_x86_id_23.html). This instruction selects a given bit from a given bit array, stores its value in the `CF` and changes its value by the applying of complement operation. So, a bit with value `1` will be `0` and vice versa:
|
||||
|
||||
```python
|
||||
>>> int(not 1)
|
||||
0
|
||||
>>> int(not 0)
|
||||
1
|
||||
```
|
||||
|
||||
The atomic version of the `__change_bit` is the `change_bit` function:
|
||||
|
||||
```C
|
||||
static inline void change_bit(long nr, volatile unsigned long *addr)
|
||||
{
|
||||
if (IS_IMMEDIATE(nr)) {
|
||||
asm volatile(LOCK_PREFIX "xorb %1,%0"
|
||||
: CONST_MASK_ADDR(nr, addr)
|
||||
: "iq" ((u8)CONST_MASK(nr)));
|
||||
} else {
|
||||
asm volatile(LOCK_PREFIX "btc %1,%0"
|
||||
: BITOP_ADDR(addr)
|
||||
: "Ir" (nr));
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
It is similar on `set_bit` function, but also has two differences. The first difference is `xor` operation instead of `or` and the second is `btc` instead of `bts`.
|
||||
|
||||
For this moment we know the most important architecture-specific operations with bit arrays. Time to look at generic bitmap API.
|
||||
|
||||
Common bit operations
|
||||
================================================================================
|
||||
|
||||
Besides the architecture-specific API from the [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) header file, the Linux kernel provides common API for manipulation of bit arrays. As we know from the beginning of this part, we can find it in the [include/linux/bitmap.h](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h) header file and additionally in the * [lib/bitmap.c](https://github.com/torvalds/linux/blob/master/lib/bitmap.c) source code file. But before these source code files let's look into the [include/linux/bitops.h](https://github.com/torvalds/linux/blob/master/include/linux/bitops.h) header file which provides a set of useful macro. Let's look on some of they.
|
||||
|
||||
First of all let's look at following four macros:
|
||||
|
||||
* `for_each_set_bit`
|
||||
* `for_each_set_bit_from`
|
||||
* `for_each_clear_bit`
|
||||
* `for_each_clear_bit_from`
|
||||
|
||||
All of these macros provide iterator over certain set of bits in a bit array. The first macro iterates over bits which are set, the second does the same, but starts from a certain bits. The last two macros do the same, but iterates over clear bits. Let's look on implementation of the `for_each_set_bit` macro:
|
||||
|
||||
```C
|
||||
#define for_each_set_bit(bit, addr, size) \
|
||||
for ((bit) = find_first_bit((addr), (size)); \
|
||||
(bit) < (size); \
|
||||
(bit) = find_next_bit((addr), (size), (bit) + 1))
|
||||
```
|
||||
|
||||
As we may see it takes three arguments and expands to the loop from first set bit which is returned as result of the `find_first_bit` function and to the last bit number while it is less than given size.
|
||||
|
||||
Besides these four macros, the [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) provides API for rotation of `64-bit` or `32-bit` values and etc.
|
||||
|
||||
The next [header](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h) file which provides API for manipulation with a bit arrays. For example it provides two functions:
|
||||
|
||||
* `bitmap_zero`;
|
||||
* `bitmap_fill`.
|
||||
|
||||
To clear a bit array and fill it with `1`. Let's look on the implementation of the `bitmap_zero` function:
|
||||
|
||||
```C
|
||||
static inline void bitmap_zero(unsigned long *dst, unsigned int nbits)
|
||||
{
|
||||
if (small_const_nbits(nbits))
|
||||
*dst = 0UL;
|
||||
else {
|
||||
unsigned int len = BITS_TO_LONGS(nbits) * sizeof(unsigned long);
|
||||
memset(dst, 0, len);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
First of all we can see the check for `nbits`. The `small_const_nbits` is macro which defined in the same header [file](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h) and looks:
|
||||
|
||||
```C
|
||||
#define small_const_nbits(nbits) \
|
||||
(__builtin_constant_p(nbits) && (nbits) <= BITS_PER_LONG)
|
||||
```
|
||||
|
||||
As we may see it checks that `nbits` is known constant in compile time and `nbits` value does not overflow `BITS_PER_LONG` or `64`. If bits number does not overflow amount of bits in a `long` value we can just set to zero. In other case we need to calculate how many `long` values do we need to fill our bit array and fill it with [memset](http://man7.org/linux/man-pages/man3/memset.3.html).
|
||||
|
||||
The implementation of the `bitmap_fill` function is similar on implementation of the `biramp_zero` function, except we fill a given bit array with `0xff` values or `0b11111111`:
|
||||
|
||||
```C
|
||||
static inline void bitmap_fill(unsigned long *dst, unsigned int nbits)
|
||||
{
|
||||
unsigned int nlongs = BITS_TO_LONGS(nbits);
|
||||
if (!small_const_nbits(nbits)) {
|
||||
unsigned int len = (nlongs - 1) * sizeof(unsigned long);
|
||||
memset(dst, 0xff, len);
|
||||
}
|
||||
dst[nlongs - 1] = BITMAP_LAST_WORD_MASK(nbits);
|
||||
}
|
||||
```
|
||||
|
||||
Besides the `bitmap_fill` and `bitmap_zero` functions, the [include/linux/bitmap.h](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h) header file provides `bitmap_copy` which is similar on the `bitmap_zero`, but just uses [memcpy](http://man7.org/linux/man-pages/man3/memcpy.3.html) instead of [memset](http://man7.org/linux/man-pages/man3/memset.3.html). Also it provides bitwise operations for bit array like `bitmap_and`, `bitmap_or`, `bitamp_xor` and etc. We will not consider implementation of these functions because it is easy to understand implementations of these functions if you understood all from this part. Anyway if you are interested how did these function implemented, you may open [include/linux/bitmap.h](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h) header file and start to research.
|
||||
|
||||
That's all.
|
||||
|
||||
Links
|
||||
================================================================================
|
||||
|
||||
* [bitmap](https://en.wikipedia.org/wiki/Bit_array)
|
||||
* [linked data structures](https://en.wikipedia.org/wiki/Linked_data_structure)
|
||||
* [tree data structures](https://en.wikipedia.org/wiki/Tree_%28data_structure%29)
|
||||
* [hot-plug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)
|
||||
* [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [IRQs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
|
||||
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
* [atomic operations](https://en.wikipedia.org/wiki/Linearizability)
|
||||
* [xchg instruction](http://x86.renejeschke.de/html/file_module_x86_id_328.html)
|
||||
* [cmpxchg instruction](http://x86.renejeschke.de/html/file_module_x86_id_41.html)
|
||||
* [lock instruction](http://x86.renejeschke.de/html/file_module_x86_id_159.html)
|
||||
* [bts instruction](http://x86.renejeschke.de/html/file_module_x86_id_25.html)
|
||||
* [btr instruction](http://x86.renejeschke.de/html/file_module_x86_id_24.html)
|
||||
* [bt instruction](http://x86.renejeschke.de/html/file_module_x86_id_22.html)
|
||||
* [sbb instruction](http://x86.renejeschke.de/html/file_module_x86_id_286.html)
|
||||
* [btc instruction](http://x86.renejeschke.de/html/file_module_x86_id_23.html)
|
||||
* [man memcpy](http://man7.org/linux/man-pages/man3/memcpy.3.html)
|
||||
* [man memset](http://man7.org/linux/man-pages/man3/memset.3.html)
|
||||
* [CF](https://en.wikipedia.org/wiki/FLAGS_register)
|
||||
* [inline assembler](https://en.wikipedia.org/wiki/Inline_assembler)
|
||||
* [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)
|
@ -0,0 +1,7 @@
|
||||
# Internal `system` structures of the Linux kernel
|
||||
|
||||
This is not usual chapter of `linux-insides`. As you may understand from the title, it mostly describes
|
||||
internal `system` structures of the Linux kernel. Like `Interrupt Descriptor Table`, `Global Descriptor
|
||||
Table` and many many more.
|
||||
|
||||
Most of information is taken from official [Intel](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html) and [AMD](http://developer.amd.com/resources/developer-guides-manuals/) manuals.
|
@ -0,0 +1,190 @@
|
||||
interrupt-descriptor table (IDT)
|
||||
================================================================================
|
||||
|
||||
Three general interrupt & exceptions sources:
|
||||
|
||||
* Exceptions - sync;
|
||||
* Software interrupts - sync;
|
||||
* External interrupts - async.
|
||||
|
||||
Types of Exceptions:
|
||||
|
||||
* Faults - are precise exceptions reported on the boundary `before` the instruction causing the exception. The saved `%rip` points to the faulting instruction;
|
||||
* Traps - are precise exceptions reported on the boundary `following` the instruction causing the exception. The same with `%rip`;
|
||||
* Aborts - are imprecise exceptions. Because they are imprecise, aborts typically do not allow reliable program restart.
|
||||
|
||||
`Maskable` interrupts trigger the interrupt-handling mechanism only when RFLAGS.IF=1. Otherwise they are held pending for as long as the RFLAGS.IF bit is cleared to 0.
|
||||
|
||||
`Nonmaskable` interrupts (NMI) are unaffected by the value of the rFLAGS.IF bit. However, the occurrence of an NMI masks further NMIs until an IRET instruction is executed.
|
||||
|
||||
Specific exception and interrupt sources are assigned a fixed vector-identification number (also called an “interrupt vector” or simply “vector”). The interrupt vector is used by the interrupt-handling mechanism to locate the system-software service routine assigned to the exception or interrupt. Up to
|
||||
256 unique interrupt vectors are available. The first 32 vectors are reserved for predefined exception and interrupt conditions. They are defined in the [arch/x86/include/asm/traps.h](http://lxr.free-electrons.com/source/arch/x86/include/asm/traps.h#L121) header file:
|
||||
|
||||
```
|
||||
/* Interrupts/Exceptions */
|
||||
enum {
|
||||
X86_TRAP_DE = 0, /* 0, Divide-by-zero */
|
||||
X86_TRAP_DB, /* 1, Debug */
|
||||
X86_TRAP_NMI, /* 2, Non-maskable Interrupt */
|
||||
X86_TRAP_BP, /* 3, Breakpoint */
|
||||
X86_TRAP_OF, /* 4, Overflow */
|
||||
X86_TRAP_BR, /* 5, Bound Range Exceeded */
|
||||
X86_TRAP_UD, /* 6, Invalid Opcode */
|
||||
X86_TRAP_NM, /* 7, Device Not Available */
|
||||
X86_TRAP_DF, /* 8, Double Fault */
|
||||
X86_TRAP_OLD_MF, /* 9, Coprocessor Segment Overrun */
|
||||
X86_TRAP_TS, /* 10, Invalid TSS */
|
||||
X86_TRAP_NP, /* 11, Segment Not Present */
|
||||
X86_TRAP_SS, /* 12, Stack Segment Fault */
|
||||
X86_TRAP_GP, /* 13, General Protection Fault */
|
||||
X86_TRAP_PF, /* 14, Page Fault */
|
||||
X86_TRAP_SPURIOUS, /* 15, Spurious Interrupt */
|
||||
X86_TRAP_MF, /* 16, x87 Floating-Point Exception */
|
||||
X86_TRAP_AC, /* 17, Alignment Check */
|
||||
X86_TRAP_MC, /* 18, Machine Check */
|
||||
X86_TRAP_XF, /* 19, SIMD Floating-Point Exception */
|
||||
X86_TRAP_IRET = 32, /* 32, IRET Exception */
|
||||
};
|
||||
```
|
||||
|
||||
Error Codes
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The processor exception-handling mechanism reports error and status information for some exceptions using an error code. The error code is pushed onto the stack by the exception-mechanism during the control transfer into the exception handler. The error code has two formats:
|
||||
|
||||
* most error-reporting exceptions format;
|
||||
* page fault format.
|
||||
|
||||
Here is format of selector error code:
|
||||
|
||||
```
|
||||
31 16 15 3 2 1 0
|
||||
+-------------------------------------------------------------------------------+
|
||||
| | | T | I | E |
|
||||
| Reserved | Selector Index | - | D | X |
|
||||
| | | I | T | T |
|
||||
+-------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
Where:
|
||||
|
||||
* `EXT` - If this bit is set to 1, the exception source is external to the processor. If cleared to 0, the exception source is internal to the processor;
|
||||
* `IDT` - If this bit is set to 1, the error-code selector-index field references a gate descriptor located in the `interrupt-descriptor table`. If cleared to 0, the selector-index field references a descriptor in either the `global-descriptor table` or local-descriptor table `LDT`, as indicated by the `TI` bit;
|
||||
* `TI` - If this bit is set to 1, the error-code selector-index field references a descriptor in the `LDT`. If cleared to 0, the selector-index field references a descriptor in the `GDT`.
|
||||
* `Selector Index` - The selector-index field specifies the index into either the `GDT`, `LDT`, or `IDT`, as specified by the `IDT` and `TI` bits.
|
||||
|
||||
Page-Fault Error Code format is:
|
||||
|
||||
```
|
||||
31 4 3 2 1 0
|
||||
+-------------------------------------------------------------------------------+
|
||||
| | | R | U | R | - |
|
||||
| Reserved | I/D | S | - | - | P |
|
||||
| | | V | S | W | - |
|
||||
+-------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
Where:
|
||||
|
||||
* `I/D` - If this bit is set to 1, it indicates that the access that caused the page fault was an instruction fetch;
|
||||
* `RSV` - If this bit is set to 1, the page fault is a result of the processor reading a 1 from a reserved field within a page-translation-table entry;
|
||||
* `U/S` - If this bit is cleared to 0, an access in supervisor mode (`CPL=0, 1, or 2`) caused the page fault. If this bit is set to 1, an access in user mode (CPL=3) caused the page fault;
|
||||
* `R/W` - If this bit is cleared to 0, the access that caused the page fault is a memory read. If this bit is set to 1, the memory access that caused the page fault was a write;
|
||||
* `P` - If this bit is cleared to 0, the page fault was caused by a not-present page. If this bit is set to 1, the page fault was caused by a page-protection violation.
|
||||
|
||||
Interrupt Control Transfers
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The IDT may contain any of three kinds of gate descriptors:
|
||||
|
||||
* `Task Gate` - contains the segment selector for a TSS for an exception and/or interrupt handler task;
|
||||
* `Interrupt Gate` - contains segment selector and offset that the processor uses to transfer program execution to a handler procedure in an interrupt handler code segment;
|
||||
* `Trap Gate` - contains segment selector and offset that the processor uses to transfer program execution to a handler procedure in an exception handler code segment.
|
||||
|
||||
General format of gates is:
|
||||
|
||||
```
|
||||
127 96
|
||||
+-------------------------------------------------------------------------------+
|
||||
| |
|
||||
| Reserved |
|
||||
| |
|
||||
+--------------------------------------------------------------------------------
|
||||
95 64
|
||||
+-------------------------------------------------------------------------------+
|
||||
| |
|
||||
| Offset 63..32 |
|
||||
| |
|
||||
+-------------------------------------------------------------------------------+
|
||||
63 48 47 46 44 42 39 34 32
|
||||
+-------------------------------------------------------------------------------+
|
||||
| | | D | | | | | | |
|
||||
| Offset 31..16 | P | P | 0 |Type |0 0 0 | 0 | 0 | IST |
|
||||
| | | L | | | | | | |
|
||||
-------------------------------------------------------------------------------+
|
||||
31 16 15 0
|
||||
+-------------------------------------------------------------------------------+
|
||||
| | |
|
||||
| Segment Selector | Offset 15..0 |
|
||||
| | |
|
||||
+-------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
Where
|
||||
|
||||
* `Selector` - Segment Selector for destination code segment;
|
||||
* `Offset` - Offset to handler procedure entry point;
|
||||
* `DPL` - Descriptor Privilege Level;
|
||||
* `P` - Segment Present flag;
|
||||
* `IST` - Interrupt Stack Table;
|
||||
* `TYPE` - one of: Local descriptor-table (LDT) segment descriptor, Task-state segment (TSS) descriptor, Call-gate descriptor, Interrupt-gate descriptor, Trap-gate descriptor or Task-gate descriptor.
|
||||
|
||||
An `IDT` descriptor is represented by the following structure in the Linux kernel (only for `x86_64`):
|
||||
|
||||
```C
|
||||
struct gate_struct64 {
|
||||
u16 offset_low;
|
||||
u16 segment;
|
||||
unsigned ist : 3, zero0 : 5, type : 5, dpl : 2, p : 1;
|
||||
u16 offset_middle;
|
||||
u32 offset_high;
|
||||
u32 zero1;
|
||||
} __attribute__((packed));
|
||||
```
|
||||
|
||||
which is defined in the [arch/x86/include/asm/desc_defs.h](http://lxr.free-electrons.com/source/arch/x86/include/asm/desc_defs.h#L51) header file.
|
||||
|
||||
A task gate descriptor does not contain `IST` field and its format differs from interrupt/trap gates:
|
||||
|
||||
```C
|
||||
struct ldttss_desc64 {
|
||||
u16 limit0;
|
||||
u16 base0;
|
||||
unsigned base1 : 8, type : 5, dpl : 2, p : 1;
|
||||
unsigned limit1 : 4, zero0 : 3, g : 1, base2 : 8;
|
||||
u32 base3;
|
||||
u32 zero1;
|
||||
} __attribute__((packed));
|
||||
```
|
||||
|
||||
Exceptions During a Task Switch
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
An exception can occur during a task switch while loading a segment selector. Page faults can also occur when accessing a TSS. In these cases, the hardware task-switch mechanism completes loading the new task state from the TSS, and then triggers the appropriate exception mechanism.
|
||||
|
||||
**In long mode, an exception cannot occur during a task switch, because the hardware task-switch mechanism is disabled.**
|
||||
|
||||
Nonmaskable interrupt
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
**TODO**
|
||||
|
||||
API
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
**TODO**
|
||||
|
||||
Interrupt Stack Table
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
**TODO**
|
@ -0,0 +1,489 @@
|
||||
Linux kernel development
|
||||
================================================================================
|
||||
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As you already may know, I've started a series of [blog posts](http://0xax.github.io/categories/assembly/) about assembler programming for `x86_64` architecture in the last year. I have never written a line of low-level code before this moment, except for a couple of toy `Hello World` examples in university. It was a long time ago and, as I already said, I didn't write low-level code at all. Some time ago I became interested in such things. I understood that I can write programs, but didn't actually understand how my program is arranged.
|
||||
|
||||
After writing some assembler code I began to understand how my program looks after compilation, **approximately**. But anyway, I didn't understand many other things. For example: what occurs when the `syscall` instruction is executed in my assembler, what occurs when the `printf` function starts to work or how can my program talk with other computers via network. [Assembler](https://en.wikipedia.org/wiki/Assembly_language#Assembler) programming language didn't give me answers to my questions and I decided to go deeper in my research. I started to learn from the source code of the Linux kernel and tried to understand the things that I'm interested in. The source code of the Linux kernel didn't give me the answers to **all** of my questions, but now my knowledge about the Linux kernel and the processes around it is much better.
|
||||
|
||||
I'm writing this part nine and a half months after I've started to learn from the source code of the Linux kernel and published the first [part](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-1.html) of this book. Now it contains forty parts and it is not the end. I decided to write this series about the Linux kernel mostly for myself. As you know the Linux kernel is very huge piece of code and it is easy to forget what does this or that part of the Linux kernel mean and how does it implement something. But soon the [linux-insides](https://github.com/0xAX/linux-insides) repo became popular and after nine months it has `9096` stars:
|
||||
|
||||
![github](http://s2.postimg.org/jjb3s4frt/stars.png)
|
||||
|
||||
It seems that people are interested in the insides of the Linux kernel. Besides this, in all the time that I have been writing `linux-insides`, I have received many questions from different people about how to begin contributing to the Linux kernel. Generally people are interested in contributing to open source projects and the Linux kernel is not an exception:
|
||||
|
||||
![google-linux](http://s4.postimg.org/yg9z5zx0d/google_linux.png)
|
||||
|
||||
So, it seems that people are interested in the Linux kernel development process. I thought it would be strange if a book about the Linux kernel would not contain a part describing how to take a part in the Linux kernel development and that's why I decided to write it. You will not find information about why you should be interested in contributing to the Linux kernel in this part. But if you are interested how to start with Linux kernel development, this part is for you.
|
||||
|
||||
Let's start.
|
||||
|
||||
How to start with Linux kernel
|
||||
---------------------------------------------------------------------------------
|
||||
|
||||
First of all, let's see how to get, build, and run the Linux kernel. You can run your custom build of the Linux kernel in two ways:
|
||||
|
||||
* Run the Linux kernel on a virtual machine;
|
||||
* Run the Linux kernel on real hardware.
|
||||
|
||||
I'll provide descriptions for both methods. Before we start doing anything with the Linux kernel, we need to get it. There are a couple of ways to do this depending on your purpose. If you just want to update the current version of the Linux kernel on your computer, you can use the instructions specific to your Linux [distro](https://en.wikipedia.org/wiki/Linux_distribution).
|
||||
|
||||
In the first case you just need to download new version of the Linux kernel with the [package manager](https://en.wikipedia.org/wiki/Package_manager). For example, to upgrade the version of the Linux kernel to `4.1` for [Ubuntu (Vivid Vervet)](http://releases.ubuntu.com/15.04/), you will just need to execute the following commands:
|
||||
|
||||
```
|
||||
$ sudo add-apt-repository ppa:kernel-ppa/ppa
|
||||
$ sudo apt-get update
|
||||
```
|
||||
|
||||
After this execute this command:
|
||||
|
||||
```
|
||||
$ apt-cache showpkg linux-headers
|
||||
```
|
||||
|
||||
and choose the version of the Linux kernel in which you are interested. In the end execute the next command and replace `${version}` with the version that you chose in the output of the previous command:
|
||||
|
||||
```
|
||||
$ sudo apt-get install linux-headers-${version} linux-headers-${version}-generic linux-image-${version}-generic --fix-missing
|
||||
```
|
||||
|
||||
and reboot your system. After the reboot you will see the new kernel in the [grub](https://en.wikipedia.org/wiki/GNU_GRUB) menu.
|
||||
|
||||
In the other way if you are interested in the Linux kernel development, you will need to get the source code of the Linux kernel. You can find it on the [kernel.org](https://kernel.org/) website and download an archive with the Linux kernel source code. Actually the Linux kernel development process is fully built around `git` [version control system](https://en.wikipedia.org/wiki/Version_control). So you can get it with `git` from the `kernel.org`:
|
||||
|
||||
```
|
||||
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
|
||||
```
|
||||
|
||||
I don't know how about you, but I prefer `github`. There is a [mirror](https://github.com/torvalds/linux) of the Linux kernel mainline repository, so you can clone it with:
|
||||
|
||||
```
|
||||
$ git clone git@github.com:torvalds/linux.git
|
||||
```
|
||||
|
||||
I use my own [fork](https://github.com/0xAX/linux) for development and when I want to pull updates from the main repository I just execute the following command:
|
||||
|
||||
```
|
||||
$ git checkout master
|
||||
$ git pull upstream master
|
||||
```
|
||||
|
||||
Note that the remote name of the main repository is `upstream`. To add a new remote with the main Linux repository you can execute:
|
||||
|
||||
```
|
||||
git remote add upstream git@github.com:torvalds/linux.git
|
||||
```
|
||||
|
||||
After this you will have two remotes:
|
||||
|
||||
```
|
||||
~/dev/linux (master) $ git remote -v
|
||||
origin git@github.com:0xAX/linux.git (fetch)
|
||||
origin git@github.com:0xAX/linux.git (push)
|
||||
upstream https://github.com/torvalds/linux.git (fetch)
|
||||
upstream https://github.com/torvalds/linux.git (push)
|
||||
```
|
||||
|
||||
One is of your fork (`origin`) and the second is for the main repository (`upstream`).
|
||||
|
||||
Now that we have a local copy of the Linux kernel source code, we need to configure and build it. The Linux kernel can be configured in different ways. The simplest way is to just copy the configuration file of the already installed kernel that is located in the `/boot` directory:
|
||||
|
||||
```
|
||||
$ sudo cp /boot/config-$(uname -r) ~/dev/linux/.config
|
||||
```
|
||||
|
||||
If your current Linux kernel was built with the support for access to the `/proc/config.gz` file, you can copy your actual kernel configuration file with this command:
|
||||
|
||||
```
|
||||
$ cat /proc/config.gz | gunzip > ~/dev/linux/.config
|
||||
```
|
||||
|
||||
If you are not satisfied with the standard kernel configuration that is provided by the maintainers of your distro, you can configure the Linux kernel manually. There are a couple of ways to do it. The Linux kernel root [Makefile](https://github.com/torvalds/linux/blob/master/Makefile) provides a set of targets that allows you to configure it. For example `menuconfig` provides a menu-driven interface for the kernel configuration:
|
||||
|
||||
![menuconfig](http://s21.postimg.org/zcz48p7yf/menucnonfig.png)
|
||||
|
||||
The `defconfig` argument generates the default kernel configuration file for the current architecture, for example [x86_64 defconfig](https://github.com/torvalds/linux/blob/master/arch/x86/configs/x86_64_defconfig). You can pass the `ARCH` command line argument to `make` to build `defconfig` for the given architecture:
|
||||
|
||||
```
|
||||
$ make ARCH=arm64 defconfig
|
||||
```
|
||||
|
||||
The `allnoconfig`, `allyesconfig` and `allmodconfig` arguments allow you to generate a new configuration file where all options will be disabled, enabled, and enabled as modules respectively. The `nconfig` command line arguments that provides `ncurses` based program with menu to configure Linux kernel:
|
||||
|
||||
![nconfig](http://s29.postimg.org/hpghikp4n/nconfig.png)
|
||||
|
||||
And even `randconfig` to generate random Linux kernel configuration file. I will not write about how to configure the Linux kernel or which options to enable because it makes no sense to do so for two reasons: First of all I do not know your hardware and second, if you know your hardware, the only remaining task is to find out how to use programs for kernel configuration, and all of them are pretty simple to use.
|
||||
|
||||
OK, we now have the source code of the Linux kernel and configured it. The next step is the compilation of the Linux kernel. The simplest way to compile Linux kernel is to just execute:
|
||||
|
||||
```
|
||||
$ make
|
||||
scripts/kconfig/conf --silentoldconfig Kconfig
|
||||
#
|
||||
# configuration written to .config
|
||||
#
|
||||
CHK include/config/kernel.release
|
||||
UPD include/config/kernel.release
|
||||
CHK include/generated/uapi/linux/version.h
|
||||
CHK include/generated/utsrelease.h
|
||||
...
|
||||
...
|
||||
...
|
||||
OBJCOPY arch/x86/boot/vmlinux.bin
|
||||
AS arch/x86/boot/header.o
|
||||
LD arch/x86/boot/setup.elf
|
||||
OBJCOPY arch/x86/boot/setup.bin
|
||||
BUILD arch/x86/boot/bzImage
|
||||
Setup is 15740 bytes (padded to 15872 bytes).
|
||||
System is 4342 kB
|
||||
CRC 82703414
|
||||
Kernel: arch/x86/boot/bzImage is ready (#73)
|
||||
```
|
||||
|
||||
To increase the speed of kernel compilation you can pass `-jN` command line argument to `make`, where `N` specifies the number of commands to run simultaneously:
|
||||
|
||||
```
|
||||
$ make -j8
|
||||
```
|
||||
|
||||
If you want to build Linux kernel for an architecture that differs from your current, the simplest way to do it pass two arguments:
|
||||
|
||||
* `ARCH` command line argument and the name of the target architecture;
|
||||
* `CROSS_COMPILER` command line argument and the cross-compiler tool prefix;
|
||||
|
||||
For example if we want to compile the Linux kernel for the [arm64](https://en.wikipedia.org/wiki/ARM_architecture#AArch64_features) with default kernel configuration file, we need to execute following command:
|
||||
|
||||
```
|
||||
$ make -j4 ARCH=arm64 CROSS_COMPILER=aarch64-linux-gnu- defconfig
|
||||
$ make -j4 ARCH=arm64 CROSS_COMPILER=aarch64-linux-gnu-
|
||||
```
|
||||
|
||||
As result of compilation we can see the compressed kernel - `arch/x86/boot/bzImage`. Now that we have compiled the kernel, we can either install it on our computer or just run it in an emulator.
|
||||
|
||||
Installing Linux kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As I already wrote we will consider two ways how to launch new kernel: In the first case we can install and run the new version of the Linux kernel on the real hardware and the second is launch the Linux kernel on a virtual machine. In the previous paragraph we saw how to build the Linux kernel from source code and as a result we have got compressed image:
|
||||
|
||||
```
|
||||
...
|
||||
...
|
||||
...
|
||||
Kernel: arch/x86/boot/bzImage is ready (#73)
|
||||
```
|
||||
|
||||
After we have got the [bzImage](https://en.wikipedia.org/wiki/Vmlinux#bzImage) we need to install `headers`, `modules` of the new Linux kernel with the:
|
||||
|
||||
```
|
||||
$ sudo make headers_install
|
||||
$ sudo make modules_install
|
||||
```
|
||||
|
||||
and directly the kernel itself:
|
||||
|
||||
```
|
||||
$ sudo make install
|
||||
```
|
||||
|
||||
From this moment we have installed new version of the Linux kernel and now we must tell the `bootloader` about it. Of course we can add it manually by the editing of the `/boot/grub2/grub.cfg` configuration file, but I prefer to use a script for this purpose. I'm using two different Linux distros: Fedora and Ubuntu. There are two different ways to update the [grub](https://en.wikipedia.org/wiki/GNU_GRUB) configuration file. I'm using following script for this purpose:
|
||||
|
||||
```shell
|
||||
#!/bin/bash
|
||||
|
||||
source "term-colors"
|
||||
|
||||
DISTRIBUTIVE=$(cat /etc/*-release | grep NAME | head -1 | sed -n -e 's/NAME\=//p')
|
||||
echo -e "Distributive: ${Green}${DISTRIBUTIVE}${Color_Off}"
|
||||
|
||||
if [[ "$DISTRIBUTIVE" == "Fedora" ]] ;
|
||||
then
|
||||
su -c 'grub2-mkconfig -o /boot/grub2/grub.cfg'
|
||||
else
|
||||
sudo update-grub
|
||||
fi
|
||||
|
||||
echo "${Green}Done.${Color_Off}"
|
||||
```
|
||||
|
||||
This is the last step of the new Linux kernel installation and after this you can reboot your computer and select new version of the kernel during boot.
|
||||
|
||||
The second case is to launch new Linux kernel in the virtual machine. I prefer [qemu](https://en.wikipedia.org/wiki/QEMU). First of all we need to build initial ramdisk - [initrd](https://en.wikipedia.org/wiki/Initrd) for this. The `initrd` is a temporary root file system that is used by the Linux kernel during initialization process while other filesystems are not mounted. We can build `initrd` with the following commands:
|
||||
|
||||
First of all we need to download [busybox](https://en.wikipedia.org/wiki/BusyBox) and run `menuconfig` for its configuration:
|
||||
|
||||
```shell
|
||||
$ mkdir initrd
|
||||
$ cd initrd
|
||||
$ curl http://busybox.net/downloads/busybox-1.23.2.tar.bz2 | tar xjf -
|
||||
$ cd busybox-1.23.2/
|
||||
$ make menuconfig
|
||||
$ make -j4
|
||||
```
|
||||
|
||||
`busybox` is an executable file - `/bin/busybox` that contains a set of standard tools like [coreutils](https://en.wikipedia.org/wiki/GNU_Core_Utilities). In the `busysbox` menu we need to enable: `Build BusyBox as a static binary (no shared libs)` option:
|
||||
|
||||
![busysbox menu](http://s18.postimg.org/sj92uoweh/busybox.png)
|
||||
|
||||
We can find this menu in the:
|
||||
|
||||
```
|
||||
Busybox Settings
|
||||
--> Build Options
|
||||
```
|
||||
|
||||
After this we exit from the `busysbox` configuration menu and execute following commands for building and installation of it:
|
||||
|
||||
```
|
||||
$ make -j4
|
||||
$ sudo make install
|
||||
```
|
||||
|
||||
Now that `busybox` is installed, we can begin building our `initrd`. To do this, we go to the previous `initrd` directory and:
|
||||
|
||||
```
|
||||
$ cd ..
|
||||
$ mkdir -p initramfs
|
||||
$ cd initramfs
|
||||
$ mkdir -pv {bin,sbin,etc,proc,sys,usr/{bin,sbin}}
|
||||
$ cp -av ../busybox-1.23.2/_install/* .
|
||||
```
|
||||
|
||||
copy `busybox` fields to the `bin`, `sbin` and other directories. Now we need to create executable `init` file that will be executed as a first process in the system. My `init` file just mounts [procfs](https://en.wikipedia.org/wiki/Procfs) and [sysfs](https://en.wikipedia.org/wiki/Sysfs) filesystems and executed shell:
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
|
||||
mount -t proc none /proc
|
||||
mount -t sysfs none /sys
|
||||
|
||||
exec /bin/sh
|
||||
```
|
||||
|
||||
Now we can create an archive that will be our `initrd`:
|
||||
|
||||
```
|
||||
$ find . -print0 | cpio --null -ov --format=newc | gzip -9 > ~/dev/initrd_x86_64.gz
|
||||
```
|
||||
|
||||
We can now run our kernel in the virtual machine. As I already wrote I prefer [qemu](https://en.wikipedia.org/wiki/QEMU) for this. We can run our kernel with the following command:
|
||||
|
||||
```
|
||||
$ qemu-system-x86_64 -snapshot -m 8GB -serial stdio -kernel ~/dev/linux/arch/x86_64/boot/bzImage -initrd ~/dev/initrd_x86_64.gz -append "root=/dev/sda1 ignore_loglevel"
|
||||
```
|
||||
|
||||
![qemu](http://s22.postimg.org/b8ttyigup/qemu.png)
|
||||
|
||||
From now we can run the Linux kernel in the virtual machine and this means that we can begin to change and test the kernel.
|
||||
|
||||
Consider using [ivandaviov/minimal](https://github.com/ivandavidov/minimal) to automate the process of generating initrd.
|
||||
|
||||
Getting started with the Linux Kernel Development
|
||||
---------------------------------------------------------------------------------
|
||||
|
||||
The main point of this paragraph is to answer two questions: What to do and what not to do before sending your first patch to the Linux kernel. Please, do not confuse this `to do` with `todo`. I have no answer what you can fix in the Linux kernel. I just want to tell you my workflow during experimenting with the Linux kernel source code.
|
||||
|
||||
First of all I pull the latest updates from Linus's repo with the following commands:
|
||||
|
||||
```
|
||||
$ git checkout master
|
||||
$ git pull upstream master
|
||||
```
|
||||
|
||||
After this my local repository with the Linux kernel source code is synced with the [mainline](https://github.com/torvalds/linux) repository. Now we can make some changes in the source code. As I already wrote, I have no advice for you where you can start and what `TODO` in the Linux kernel. But the best place for newbies is `staging` tree. In other words the set of drivers from the [drivers/staging](https://github.com/torvalds/linux/tree/master/drivers/staging). The maintainer of the `staging` tree is [Greg Kroah-Hartman](https://en.wikipedia.org/wiki/Greg_Kroah-Hartman) and the `staging` tree is that place where your trivial patch can be accepted. Let's look on a simple example that describes how to generate patch, check it and send to the [Linux kernel mail listing](https://lkml.org/).
|
||||
|
||||
If we look in the driver for the [Digi International EPCA PCI](https://github.com/torvalds/linux/tree/master/drivers/staging/dgap) based devices, we will see the `dgap_sindex` function on line 295:
|
||||
|
||||
```C
|
||||
static char *dgap_sindex(char *string, char *group)
|
||||
{
|
||||
char *ptr;
|
||||
|
||||
if (!string || !group)
|
||||
return NULL;
|
||||
|
||||
for (; *string; string++) {
|
||||
for (ptr = group; *ptr; ptr++) {
|
||||
if (*ptr == *string)
|
||||
return string;
|
||||
}
|
||||
}
|
||||
|
||||
return NULL;
|
||||
}
|
||||
```
|
||||
|
||||
This function looks for a match of any character in the group and returns that position. During research of source code of the Linux kernel, I have noted that the [lib/string.c](https://github.com/torvalds/linux/blob/master/lib/string.c#L473) source code file contains the implementation of the `strpbrk` function that does the same thing as `dgap_sinidex`. It is not a good idea to use a custom implementation of a function that already exists, so we can remove the `dgap_sindex` function from the [drivers/staging/dgap/dgap.c](https://github.com/torvalds/linux/blob/master/drivers/staging/dgap/dgap.c) source code file and use the `strpbrk` instead.
|
||||
|
||||
First of all let's create new `git` branch based on the current master that synced with the Linux kernel mainline repo:
|
||||
|
||||
```
|
||||
$ git checkout -b "dgap-remove-dgap_sindex"
|
||||
```
|
||||
|
||||
And now we can replace the `dgap_sindex` with the `strpbrk`. After we did all changes we need to recompile the Linux kernel or just [dgap](https://github.com/torvalds/linux/tree/master/drivers/staging/dgap) directory. Do not forget to enable this driver in the kernel configuration. You can find it in the:
|
||||
|
||||
```
|
||||
Device Drivers
|
||||
--> Staging drivers
|
||||
----> Digi EPCA PCI products
|
||||
```
|
||||
|
||||
![dgap menu](http://s4.postimg.org/d3pozpge5/digi.png)
|
||||
|
||||
Now is time to make commit. I'm using following combination for this:
|
||||
|
||||
```
|
||||
$ git add .
|
||||
$ git commit -s -v
|
||||
```
|
||||
|
||||
After the last command an editor will be opened that will be chosen from `$GIT_EDITOR` or `$EDITOR` environment variable. The `-s` command line argument will add `Signed-off-by` line by the committer at the end of the commit log message. You can find this line in the end of each commit message, for example - [00cc1633](https://github.com/torvalds/linux/commit/00cc1633816de8c95f337608a1ea64e228faf771). The main point of this line is the tracking of who did a change. The `-v` option show unified diff between the HEAD commit and what would be committed at the bottom of the commit message. It is not necessary, but very useful sometimes. A couple of words about commit message. Actually a commit message consists from two parts:
|
||||
|
||||
The first part is on the first line and contains short description of changes. It starts from the `[PATCH]` prefix followed by a subsystem, driver or architecture name and after `:` symbol short description. In our case it will be something like this:
|
||||
|
||||
```
|
||||
[PATCH] staging/dgap: Use strpbrk() instead of dgap_sindex()
|
||||
```
|
||||
|
||||
After short description usually we have an empty line and full description of the commit. In our case it will be:
|
||||
|
||||
```
|
||||
The <linux/string.h> provides strpbrk() function that does the same that the
|
||||
dgap_sindex(). Let's use already defined function instead of writing custom.
|
||||
```
|
||||
|
||||
And the `Sign-off-by` line in the end of the commit message. Note that each line of a commit message must no be longer than `80` symbols and commit message must describe your changes in details. Do not just write a commit message like: `Custom function removed`, you need to describe what you did and why. The patch reviewers must know what they review. Besides this commit messages in this view are very helpful. Each time when we can't understand something, we can use [git blame](http://git-scm.com/docs/git-blame) to read description of changes.
|
||||
|
||||
After we have committed changes time to generate patch. We can do it with the `format-patch` command:
|
||||
|
||||
```
|
||||
$ git format-patch master
|
||||
0001-staging-dgap-Use-strpbrk-instead-of-dgap_sindex.patch
|
||||
```
|
||||
|
||||
We've passed name of the branch (`master` in this case) to the `format-patch` command that will generate a patch with the last changes that are in the `dgap-remove-dgap_sindex` branch and not are in the `master` branch. As you can note, the `format-patch` command generates file that contains last changes and has name that is based on the commit short description. If you want to generate a patch with the custom name, you can use `--stdout` option:
|
||||
|
||||
```
|
||||
$ git format-patch master --stdout > dgap-patch-1.patch
|
||||
```
|
||||
|
||||
The last step after we have generated our patch is to send it to the Linux kernel mailing list. Of course, you can use any email client, `git` provides a special command for this: `git send-email`. Before you send your patch, you need to know where to send it. Yes, you can just send it to the Linux kernel mailing list address which is `linux-kernel@vger.kernel.org`, but it is very likely that the patch will be ignored, because of the large flow of messages. The better choice would be to send the patch to the maintainers of the subsystem where you have made changes. To find the names of these maintainers use the `get_maintainer.pl` script. All you need to do is pass the file or directory where you wrote code.
|
||||
|
||||
```
|
||||
$ ./scripts/get_maintainer.pl -f drivers/staging/dgap/dgap.c
|
||||
Lidza Louina <lidza.louina@gmail.com> (maintainer:DIGI EPCA PCI PRODUCTS)
|
||||
Mark Hounschell <markh@compro.net> (maintainer:DIGI EPCA PCI PRODUCTS)
|
||||
Daeseok Youn <daeseok.youn@gmail.com> (maintainer:DIGI EPCA PCI PRODUCTS)
|
||||
Greg Kroah-Hartman <gregkh@linuxfoundation.org> (supporter:STAGING SUBSYSTEM)
|
||||
driverdev-devel@linuxdriverproject.org (open list:DIGI EPCA PCI PRODUCTS)
|
||||
devel@driverdev.osuosl.org (open list:STAGING SUBSYSTEM)
|
||||
linux-kernel@vger.kernel.org (open list)
|
||||
```
|
||||
|
||||
You will see the set of the names and related emails. Now we can send our patch with:
|
||||
|
||||
```
|
||||
$ git send-email --to "Lidza Louina <lidza.louina@gmail.com>" \
|
||||
--cc "Mark Hounschell <markh@compro.net>" \
|
||||
--cc "Daeseok Youn <daeseok.youn@gmail.com>" \
|
||||
--cc "Greg Kroah-Hartman <gregkh@linuxfoundation.org>" \
|
||||
--cc "driverdev-devel@linuxdriverproject.org" \
|
||||
--cc "devel@driverdev.osuosl.org" \
|
||||
--cc "linux-kernel@vger.kernel.org"
|
||||
```
|
||||
|
||||
That's all. The patch is sent and now you only have to wait for feedback from the Linux kernel developers. After you send a patch and a maintainer accepts it, you will find it in the maintainer's repository (for example [patch](https://git.kernel.org/cgit/linux/kernel/git/gregkh/staging.git/commit/?h=staging-testing&id=b9f7f1d0846f15585b8af64435b6b706b25a5c0b) that you saw in this part) and after some time the maintainer will send a pull request to Linus and you will see your patch in the mainline repository.
|
||||
|
||||
That's all.
|
||||
|
||||
Some advice
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In the end of this part I want to give you some advice that will describe what to do and what not to do during development of the Linux kernel:
|
||||
|
||||
* Think, Think, Think. And think again before you decide to send a patch.
|
||||
|
||||
* Each time when you have changed something in the Linux kernel source code - compile it. After any changes. Again and again. Nobody likes changes that don't even compile.
|
||||
|
||||
* The Linux kernel has a coding style [guide](https://github.com/torvalds/linux/blob/master/Documentation/CodingStyle) and you need to comply with it. There is great script which can help to check your changes. This script is - [scripts/checkpatch.pl](https://github.com/torvalds/linux/blob/master/scripts/checkpatch.pl). Just pass source code file with changes to it and you will see:
|
||||
|
||||
```
|
||||
$ ./scripts/checkpatch.pl -f drivers/staging/dgap/dgap.c
|
||||
WARNING: Block comments use * on subsequent lines
|
||||
#94: FILE: drivers/staging/dgap/dgap.c:94:
|
||||
+/*
|
||||
+ SUPPORTED PRODUCTS
|
||||
|
||||
CHECK: spaces preferred around that '|' (ctx:VxV)
|
||||
#143: FILE: drivers/staging/dgap/dgap.c:143:
|
||||
+ { PPCM, PCI_DEV_XEM_NAME, 64, (T_PCXM|T_PCLITE|T_PCIBUS) },
|
||||
|
||||
```
|
||||
|
||||
Also you can see problematic places with the help of the `git diff`:
|
||||
|
||||
![git diff](http://oi60.tinypic.com/2u91rgn.jpg)
|
||||
|
||||
* [Linus doesn't accept github pull requests](https://github.com/torvalds/linux/pull/17#issuecomment-5654674)
|
||||
|
||||
* If your change consists from some different and unrelated changes, you need to split the changes via separate commits. The `git format-patch` command will generate patches for each commit and the subject of each patch will contain a `vN` prefix where the `N` is the number of the patch. If you are planning to send a series of patches it will be helpful to pass the `--cover-letter` option to the `git format-patch` command. This will generate an additional file that will contain the cover letter that you can use to describe what your patchset changes. It is also a good idea to use the `--in-reply-to` option in the `git send-email` command. This option allows you to send your patch series in reply to your cover message. The structure of the your patch will look like this for a maintainer:
|
||||
|
||||
```
|
||||
|--> cover letter
|
||||
|----> patch_1
|
||||
|----> patch_2
|
||||
```
|
||||
|
||||
You need to pass `message-id` as an argument of the `--in-reply-to` option that you can find in the output of the `git send-email`:
|
||||
|
||||
It's important that your email be in the [plain text](https://en.wikipedia.org/wiki/Plain_text) format. Generally, `send-email` and `format-patch` are very useful during development, so look at the documentation for the commands and you'll find some useful options such as: [git send-email](http://git-scm.com/docs/git-send-email) and [git format-patch](http://git-scm.com/docs/git-format-patch).
|
||||
|
||||
* Do not be surprised if you do not get an immediate answer after you send your patch. Maintainers can be very busy.
|
||||
|
||||
* The [scripts](https://github.com/torvalds/linux/tree/master/scripts) directory contains many different useful scripts that are related to Linux kernel development. We already saw two scripts from this directory: the `checkpatch.pl` and the `get_maintainer.pl` scripts. Outside of those scripts, you can find the [stackusage](https://github.com/torvalds/linux/blob/master/scripts/stackusage) script that will print usage of the stack, [extract-vmlinux](https://github.com/torvalds/linux/blob/master/scripts/extract-vmlinux) for extracting an uncompressed kernel image, and many others. Outside of the `scripts` directory you can find some very useful [scripts](https://github.com/lorenzo-stoakes/kernel-scripts) by [Lorenzo Stoakes](https://twitter.com/ljsloz) for kernel development.
|
||||
|
||||
* Subscribe to the Linux kernel mailing list. There are a large number of letters every day on `lkml`, but it is very useful to read them and understand things such as the current state of the Linux kernel. Other than `lkml` there are [set](http://vger.kernel.org/vger-lists.html) mailing listings which are related to the different Linux kernel subsystems.
|
||||
|
||||
* If your patch is not accepted the first time and you receive feedback from Linux kernel developers, make your changes and resend the patch with the `[PATCH vN]` prefix (where `N` is the number of patch version). For example:
|
||||
|
||||
```
|
||||
[PATCH v2] staging/dgap: Use strpbrk() instead of dgap_sindex()
|
||||
```
|
||||
|
||||
Also it must contain a changelog that describes all changes from previous patch versions. Of course, this is not an exhaustive list of requirements for Linux kernel development, but some of the most important items were addressed.
|
||||
|
||||
Happy Hacking!
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
I hope this will help others join the Linux kernel community!
|
||||
If you have any questions or suggestions, write me at [email](kuleshovmail@gmail.com) or ping [me](https://twitter.com/0xAX) on twitter.
|
||||
|
||||
Please note that English is not my first language, and I am really sorry for any inconvenience. If you find any mistakes please let me know via email or send a PR.
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [blog posts about assembly programming for x86_64](http://0xax.github.io/categories/assembly/)
|
||||
* [Assembler](https://en.wikipedia.org/wiki/Assembly_language#Assembler)
|
||||
* [distro](https://en.wikipedia.org/wiki/Linux_distribution)
|
||||
* [package manager](https://en.wikipedia.org/wiki/Package_manager)
|
||||
* [grub](https://en.wikipedia.org/wiki/GNU_GRUB)
|
||||
* [kernel.org](https://kernel.org/)
|
||||
* [version control system](https://en.wikipedia.org/wiki/Version_control)
|
||||
* [arm64](https://en.wikipedia.org/wiki/ARM_architecture#AArch64_features)
|
||||
* [bzImage](https://en.wikipedia.org/wiki/Vmlinux#bzImage)
|
||||
* [qemu](https://en.wikipedia.org/wiki/QEMU)
|
||||
* [initrd](https://en.wikipedia.org/wiki/Initrd)
|
||||
* [busybox](https://en.wikipedia.org/wiki/BusyBox)
|
||||
* [coreutils](https://en.wikipedia.org/wiki/GNU_Core_Utilities)
|
||||
* [procfs](https://en.wikipedia.org/wiki/Procfs)
|
||||
* [sysfs](https://en.wikipedia.org/wiki/Sysfs)
|
||||
* [Linux kernel mail listing archive](https://lkml.org/)
|
||||
* [Linux kernel coding style guide](https://github.com/torvalds/linux/blob/master/Documentation/CodingStyle)
|
||||
* [How to Get Your Change Into the Linux Kernel](https://github.com/torvalds/linux/blob/master/Documentation/SubmittingPatches)
|
||||
* [Linux Kernel Newbies](http://kernelnewbies.org/)
|
||||
* [plain text](https://en.wikipedia.org/wiki/Plain_text)
|
@ -0,0 +1,486 @@
|
||||
Program startup process in userspace
|
||||
================================================================================
|
||||
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Despite the [linux-insides](https://www.gitbook.com/book/0xax/linux-insides/details) described mostly Linux kernel related stuff, I have decided to write this one part which mostly related to userspace.
|
||||
|
||||
There is already fourth [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-4.html) of [system calls](https://en.wikipedia.org/wiki/System_call) chapter which describes what does the Linux kernel do when we want to start a program. In this part I want to explore what happens when we run a program on Linux machine from userspace perspective.
|
||||
|
||||
I don't know how about you, but I learned in my university that a `C` program starts to execute from the function which is called `main`. And that's partly true. Everytime, when we are starting to write new program, we start our program from the following lines of code:
|
||||
|
||||
```C
|
||||
int main(int argc, char *argv[]) {
|
||||
// Entry point is here
|
||||
}
|
||||
```
|
||||
|
||||
But if you interested in low-level programming, maybe you already know that the `main` function isn't actual entry point of a program. We can make sure in this, if we will look at this simple program:
|
||||
|
||||
```C
|
||||
int main(int argc, char *argv[]) {
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
in debugger. Let's compile this and run in [gdb](https://www.gnu.org/software/gdb/):
|
||||
|
||||
```
|
||||
$ gcc -ggdb program.c -o program
|
||||
$ gdb ./program
|
||||
The target architecture is assumed to be i386:x86-64:intel
|
||||
Reading symbols from ./program...done.
|
||||
```
|
||||
|
||||
Let's execute gdb `info` subcommand with `files` argument. The `info files` must print information about debugging targets and memory spaces occupied by different sections.
|
||||
|
||||
```
|
||||
(gdb) info files
|
||||
Symbols from "/home/alex/program".
|
||||
Local exec file:
|
||||
`/home/alex/program', file type elf64-x86-64.
|
||||
Entry point: 0x400430
|
||||
0x0000000000400238 - 0x0000000000400254 is .interp
|
||||
0x0000000000400254 - 0x0000000000400274 is .note.ABI-tag
|
||||
0x0000000000400274 - 0x0000000000400298 is .note.gnu.build-id
|
||||
0x0000000000400298 - 0x00000000004002b4 is .gnu.hash
|
||||
0x00000000004002b8 - 0x0000000000400318 is .dynsym
|
||||
0x0000000000400318 - 0x0000000000400357 is .dynstr
|
||||
0x0000000000400358 - 0x0000000000400360 is .gnu.version
|
||||
0x0000000000400360 - 0x0000000000400380 is .gnu.version_r
|
||||
0x0000000000400380 - 0x0000000000400398 is .rela.dyn
|
||||
0x0000000000400398 - 0x00000000004003c8 is .rela.plt
|
||||
0x00000000004003c8 - 0x00000000004003e2 is .init
|
||||
0x00000000004003f0 - 0x0000000000400420 is .plt
|
||||
0x0000000000400420 - 0x0000000000400428 is .plt.got
|
||||
0x0000000000400430 - 0x00000000004005e2 is .text
|
||||
0x00000000004005e4 - 0x00000000004005ed is .fini
|
||||
0x00000000004005f0 - 0x0000000000400610 is .rodata
|
||||
0x0000000000400610 - 0x0000000000400644 is .eh_frame_hdr
|
||||
0x0000000000400648 - 0x000000000040073c is .eh_frame
|
||||
0x0000000000600e10 - 0x0000000000600e18 is .init_array
|
||||
0x0000000000600e18 - 0x0000000000600e20 is .fini_array
|
||||
0x0000000000600e20 - 0x0000000000600e28 is .jcr
|
||||
0x0000000000600e28 - 0x0000000000600ff8 is .dynamic
|
||||
0x0000000000600ff8 - 0x0000000000601000 is .got
|
||||
0x0000000000601000 - 0x0000000000601028 is .got.plt
|
||||
0x0000000000601028 - 0x0000000000601034 is .data
|
||||
0x0000000000601034 - 0x0000000000601038 is .bss
|
||||
```
|
||||
|
||||
Note on `Entry point: 0x400430` line. Now we know the actual address of entry point of our program. Let's put breakpoint by this address, run our program and will see what happens:
|
||||
|
||||
```
|
||||
(gdb) break *0x400430
|
||||
Breakpoint 1 at 0x400430
|
||||
(gdb) run
|
||||
Starting program: /home/alex/program
|
||||
|
||||
Breakpoint 1, 0x0000000000400430 in _start ()
|
||||
```
|
||||
|
||||
Interesting. We don't see execution of `main` function here, but we may see that another function is called. This function is - `_start` and as debugger showed us, it is actual entry point of our program. Where is this function come from? Who does call `main` and when it will be called. I will try to answer all of these questions in this post.
|
||||
|
||||
How kernel does start new program
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
First of all, let's take following simple `C` program:
|
||||
|
||||
```C
|
||||
// program.c
|
||||
|
||||
#include <stdlib.h>
|
||||
#include <stdio.h>
|
||||
|
||||
static int x = 1;
|
||||
|
||||
int y = 2;
|
||||
|
||||
int main(int argc, char *argv[]) {
|
||||
int z = 3;
|
||||
|
||||
printf("x + y + z = %d\n", x + y + z);
|
||||
|
||||
return EXIT_SUCCESS;
|
||||
}
|
||||
```
|
||||
|
||||
We can be sure that this program works as we expect. Let's compile it:
|
||||
|
||||
```
|
||||
$ gcc -Wall program.c -o sum
|
||||
```
|
||||
|
||||
and run:
|
||||
|
||||
```
|
||||
./sum
|
||||
x + y + z = 6
|
||||
```
|
||||
|
||||
Ok, everything looks pretty good for now. You already may know that there is special family of [system calls](https://en.wikipedia.org/wiki/System_call) - [exec*](http://man7.org/linux/man-pages/man3/execl.3.html) system calls. As we may read in the man page:
|
||||
|
||||
> The exec() family of functions replaces the current process image with a new process image.
|
||||
|
||||
If you have read fourth [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-4.html) of the chapter which describes [system calls](https://en.wikipedia.org/wiki/System_call), you may know that for example [execve](http://linux.die.net/man/2/execve) system call is defined in the [files/exec.c](https://github.com/torvalds/linux/blob/master/fs/exec.c#L1859) source code file and looks like:
|
||||
|
||||
```C
|
||||
SYSCALL_DEFINE3(execve,
|
||||
const char __user *, filename,
|
||||
const char __user *const __user *, argv,
|
||||
const char __user *const __user *, envp)
|
||||
{
|
||||
return do_execve(getname(filename), argv, envp);
|
||||
}
|
||||
```
|
||||
|
||||
It takes executable file name, set of command line arguments and set of enviroment variables. As you may guess, everything is done by the `do_execve` function. I will not describe implementation of the `do_execve` function in details because you can read about this in [here](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-4.html). But in short words, the `do_execve` function does many checks like `filename` is valid, limit of launched processes is not exceed in our system and etc. After all of these checks, this function parses our executable file which is represented in [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) format, creates memory descriptor for newly executed executable file and fills it with the appropriate values like area for the stack, heap and etc. As the setup of new binary image is done, the `start_thread` function which will execute setup of new process. This function is architecture-specific and for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture, its definition will be located in the [arch/x86/kernel/process_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/process_64.c#L231) source code file.
|
||||
|
||||
The `start_thread` function sets new value of [segment registers](https://en.wikipedia.org/wiki/X86_memory_segmentation) and program execution address. From this point, new process is ready to start. Once the [context switch](https://en.wikipedia.org/wiki/Context_switch) will be done, control will be returned to the userspace with new values of registers and new executable will be started to execute.
|
||||
|
||||
That's all from the kernel side. The Linux kernel prepares binary image for execution and its execution starts right after context switch will return controll to userspace. But it does not answer on questions like where is from `_start` come and others. Let's try to answer on these questions in the next paragraph.
|
||||
|
||||
How does program start in userspace
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In the previous paragraph we saw how an executable file is prepared to run by the Linux kernel. Let's look at the same, but from userspace side. We already know that entry point of each program is `_start` function. But where is this function from? It may came from a library. But if you remember correctly we didn't link our program with any libraries during compilation of our program:
|
||||
|
||||
```
|
||||
$ gcc -Wall program.c -o sum
|
||||
```
|
||||
|
||||
You may guess that `_start` comes from [stanard libray](https://en.wikipedia.org/wiki/Standard_library) and that's true. If try to compile our program again and pass `-v` option to gcc which will enable `verbose mode`, we will see following long output. Full output is not intereing for us, let's look at the following steps: First of all our program will be compiled with `cc`:
|
||||
|
||||
```
|
||||
$ gcc -v -ggdb program.c -o sum
|
||||
...
|
||||
...
|
||||
...
|
||||
/usr/libexec/gcc/x86_64-redhat-linux/6.1.1/cc1 -quiet -v program.c -quiet -dumpbase program.c -mtune=generic -march=x86-64 -auxbase test -ggdb -version -o /tmp/ccvUWZkF.s
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
The `cc1` compiler will compile our `C` source code and produce assembly `/tmp/ccvUWZkF.s` file. After this we may see that our assembly file will be compiled into object file with `GNU as` assembler:
|
||||
|
||||
```
|
||||
$ gcc -v -ggdb program.c -o sum
|
||||
...
|
||||
...
|
||||
...
|
||||
as -v --64 -o /tmp/cc79wZSU.o /tmp/ccvUWZkF.s
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
And in the end our object file will be linked with `collect2`:
|
||||
|
||||
```
|
||||
$ gcc -v -ggdb program.c -o sum
|
||||
...
|
||||
...
|
||||
...
|
||||
/usr/libexec/gcc/x86_64-redhat-linux/6.1.1/collect2 -plugin /usr/libexec/gcc/x86_64-redhat-linux/6.1.1/liblto_plugin.so -plugin-opt=/usr/libexec/gcc/x86_64-redhat-linux/6.1.1/lto-wrapper -plugin-opt=-fresolution=/tmp/ccLEGYra.res -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s --build-id --no-add-needed --eh-frame-hdr --hash-style=gnu -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o test /usr/lib/gcc/x86_64-redhat-linux/6.1.1/../../../../lib64/crt1.o /usr/lib/gcc/x86_64-redhat-linux/6.1.1/../../../../lib64/crti.o /usr/lib/gcc/x86_64-redhat-linux/6.1.1/crtbegin.o -L/usr/lib/gcc/x86_64-redhat-linux/6.1.1 -L/usr/lib/gcc/x86_64-redhat-linux/6.1.1/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L. -L/usr/lib/gcc/x86_64-redhat-linux/6.1.1/../../.. /tmp/cc79wZSU.o -lgcc --as-needed -lgcc_s --no-as-needed -lc -lgcc --as-needed -lgcc_s --no-as-needed /usr/lib/gcc/x86_64-redhat-linux/6.1.1/crtend.o /usr/lib/gcc/x86_64-redhat-linux/6.1.1/../../../../lib64/crtn.o
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
Yes, we may see long set of command line options which are passed to the linker. Let's go the other way. We know that our program depends on `stdlib`:
|
||||
|
||||
```
|
||||
~$ ldd program
|
||||
linux-vdso.so.1 (0x00007ffc9afd2000)
|
||||
libc.so.6 => /lib64/libc.so.6 (0x00007f56b389b000)
|
||||
/lib64/ld-linux-x86-64.so.2 (0x0000556198231000)
|
||||
```
|
||||
|
||||
as we use some stuff from there like `printf` and etc. But not only. That's why we will get error if we will pass `-nostdlib` option to the compiler:
|
||||
|
||||
```
|
||||
~$ gcc -nostdlib program.c -o program
|
||||
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 000000000040017c
|
||||
/tmp/cc02msGW.o: In function `main':
|
||||
/home/alex/program.c:11: undefined reference to `printf'
|
||||
collect2: error: ld returned 1 exit status
|
||||
```
|
||||
|
||||
Besides other errors, we also see that `_start` symbol is undefined. So now we are sure that the `_start` function comes from standard library. But even if we will link it with standard library, it will not be compiled successfully anyway:
|
||||
|
||||
```
|
||||
~$ gcc -nostdlib -lc -ggdb program.c -o program
|
||||
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000400350
|
||||
```
|
||||
|
||||
Ok, compiler will not complains about undefined reference of standard library functions as we linked our program with `/usr/lib64/libc.so.6`, but the `_start` symbol isn't resolved yet. Let's return to the verbose output of `gcc` and look at the parameters of `collect2`. First important thing that we may see is our program is linked not only with standard library, but also with some object files. The first object file is: `/lib64/crt1.o`. And if we will look inside this object file with `objdump` util, we will see the `_start` symbol:
|
||||
|
||||
```
|
||||
$ objdump -d /lib64/crt1.o
|
||||
|
||||
/lib64/crt1.o: file format elf64-x86-64
|
||||
|
||||
|
||||
Disassembly of section .text:
|
||||
|
||||
0000000000000000 <_start>:
|
||||
0: 31 ed xor %ebp,%ebp
|
||||
2: 49 89 d1 mov %rdx,%r9
|
||||
5: 5e pop %rsi
|
||||
6: 48 89 e2 mov %rsp,%rdx
|
||||
9: 48 83 e4 f0 and $0xfffffffffffffff0,%rsp
|
||||
d: 50 push %rax
|
||||
e: 54 push %rsp
|
||||
f: 49 c7 c0 00 00 00 00 mov $0x0,%r8
|
||||
16: 48 c7 c1 00 00 00 00 mov $0x0,%rcx
|
||||
1d: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
|
||||
24: e8 00 00 00 00 callq 29 <_start+0x29>
|
||||
29: f4 hlt
|
||||
```
|
||||
|
||||
As `crt1.o` is a shared object file, we see only stubs here instead of real calls. Let's look at the source code of the `_start` function. As this function is architecture specific, implementation for `_start` will be located in the [sysdeps/x86_64/start.S](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/start.S;h=f1b961f5ba2d6a1ebffee0005f43123c4352fbf4;hb=HEAD) assembly file.
|
||||
|
||||
The `_start` starts from the clearing of `%ebp` register as [ABI](https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf) suggests
|
||||
|
||||
```assembly
|
||||
xorl %ebp, %ebp
|
||||
```
|
||||
|
||||
And after this we put the address of termination function to the `%r9` register:
|
||||
|
||||
```assembly
|
||||
mov %RDX_LP, %R9_LP
|
||||
```
|
||||
|
||||
As described in the [ELF](http://flint.cs.yale.edu/cs422/doc/ELF_Format.pdf) specification:
|
||||
|
||||
> After the dynamic linker has built the process image and performed the relocations, each shared object
|
||||
> gets the opportunity to execute some initialization code.
|
||||
> ...
|
||||
> Similarly, shared objects may have termination functions, which are executed with the atexit (BA_OS)
|
||||
> mechanism after the base process begins its termination sequence.
|
||||
|
||||
So we need to put address of termination function to the `%r9` register as it will be passed `__libc_start_main` in future as sixth argument. Note that initially, address of the termination function is located in the `%rdx` register. Other registers besides `%rdx` and `%rsp` contain unspecified values. Actually main point of the `_start` function is to call `__libc_start_main`. So the next actions will be preparations to call this function.
|
||||
|
||||
The signature of the `__libc_start_main` function is located in the [csu/libc-start.c](https://sourceware.org/git/?p=glibc.git;a=blob;f=csu/libc-start.c;h=0fb98f1606bab475ab5ba2d0fe08c64f83cce9df;hb=HEAD) source code file. Let's look on it:
|
||||
|
||||
```C
|
||||
STATIC int LIBC_START_MAIN (int (*main) (int, char **, char **),
|
||||
int argc,
|
||||
char **argv,
|
||||
__typeof (main) init,
|
||||
void (*fini) (void),
|
||||
void (*rtld_fini) (void),
|
||||
void *stack_end)
|
||||
```
|
||||
|
||||
It takes address of the `main` function of a program, `argc` and `argv`. `init` and `fini` functions are constructor and destructor of the program. The `rtld_fini` is termination function which will be called after the program will be exited to terminate and free dynamic section. The last parameter of the `__libc_start_main` is the pointer to the stack of the program. Before we can call the `__libc_start_main` function, all of these parameters must be prepared and passed to it. Let's return to the [sysdeps/x86_64/start.S](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/start.S;h=f1b961f5ba2d6a1ebffee0005f43123c4352fbf4;hb=HEAD) assembly file and continue to see what happens before the `__libc_start_main` function will be called from there.
|
||||
|
||||
All of we need for the `__libc_start_main` function, we can get from the stack. As `_start` is called, our stack is looks like:
|
||||
|
||||
```
|
||||
+-----------------+
|
||||
| NULL |
|
||||
+-----------------+
|
||||
| envp |
|
||||
+-----------------+
|
||||
| NULL |
|
||||
+------------------
|
||||
| argv | <- %rsp
|
||||
+------------------
|
||||
| argc |
|
||||
+-----------------+
|
||||
```
|
||||
|
||||
At the next step as we cleared `%ebp` register and save address of the termination function in the `%r9` register, we pop element from the stack to the `%rsi` register, so after this `%rsp` will point to the `argv` array and `%rsi` will contain count of command line arguemnts passed to the program:
|
||||
|
||||
```
|
||||
+-----------------+
|
||||
| NULL |
|
||||
+-----------------+
|
||||
| envp |
|
||||
+-----------------+
|
||||
| NULL |
|
||||
+------------------
|
||||
| argv | <- %rsp
|
||||
+-----------------+
|
||||
```
|
||||
|
||||
And after this we may move address of the `argv` array to the `%rdx` register
|
||||
|
||||
```assembly
|
||||
popq %rsi
|
||||
mov %RSP_LP, %RDX_LP
|
||||
```
|
||||
|
||||
From this moment we have `argc`, `argv`. Still to put pointers to the construtor, destructor in appropriate registers and pass pointer to the stack. At the first following three lines we align stack to `16` bytes boundary as suggested in [ABI](https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf) and push `%rax` which contains garbage:
|
||||
|
||||
```assembly
|
||||
and $~15, %RSP_LP
|
||||
pushq %rax
|
||||
|
||||
pushq %rsp
|
||||
mov $__libc_csu_fini, %R8_LP
|
||||
mov $__libc_csu_init, %RCX_LP
|
||||
mov $main, %RDI_LP
|
||||
```
|
||||
|
||||
After stack aligning we push address of the stack, addresses of contstructor and destructor to the `%r8` and `%rcx` registers and address of the `main` symbol to the `%rdi`. From this moment we may call the `__libc_start_main` function from the [csu/libc-start.c](https://sourceware.org/git/?p=glibc.git;a=blob;f=csu/libc-start.c;h=0fb98f1606bab475ab5ba2d0fe08c64f83cce9df;hb=HEAD).
|
||||
|
||||
Before we will look at the `__libc_start_main` function let's add the `/lib64/crt1.o` and try to compile our program again:
|
||||
|
||||
```
|
||||
$ gcc -nostdlib /lib64/crt1.o -lc -ggdb program.c -o program
|
||||
/lib64/crt1.o: In function `_start':
|
||||
(.text+0x12): undefined reference to `__libc_csu_fini'
|
||||
/lib64/crt1.o: In function `_start':
|
||||
(.text+0x19): undefined reference to `__libc_csu_init'
|
||||
collect2: error: ld returned 1 exit status
|
||||
```
|
||||
|
||||
Now we see another error that both `__libc_csu_fini` and `__libc_csu_init` functions are not found. We know that addresses of both of these functions are passed to the `__libc_start_main` as parameters and also these functions are constructor and destructor of our programs. But what are `constructor` and `destructor` in terms of `C` program means? We already saw the quote from the [ELF](http://flint.cs.yale.edu/cs422/doc/ELF_Format.pdf) specification:
|
||||
|
||||
> After the dynamic linker has built the process image and performed the relocations, each shared object
|
||||
> gets the opportunity to execute some initialization code.
|
||||
> ...
|
||||
> Similarly, shared objects may have termination functions, which are executed with the atexit (BA_OS)
|
||||
> mechanism after the base process begins its termination sequence.
|
||||
|
||||
So the linker besides usual sections like `.text`, `.data` and others, creates two special sections:
|
||||
|
||||
* `.init`
|
||||
* `.fini`
|
||||
|
||||
We can find it with `readelf` util:
|
||||
|
||||
```
|
||||
~$ readelf -e test | grep init
|
||||
[11] .init PROGBITS 00000000004003c8 000003c8
|
||||
|
||||
~$ readelf -e test | grep fini
|
||||
[15] .fini PROGBITS 0000000000400504 00000504
|
||||
```
|
||||
|
||||
Both of these sections will be placed at the start and end of binary image and contain routines which are called constructor and destructor respectively. The main point of these routines is to do some initialization/finalization like initialization of global variables like [errno](http://man7.org/linux/man-pages/man3/errno.3.html), allocation and deallocation of memory for system routines and etc., before actual code of a program will be executed.
|
||||
|
||||
As you may understand from names of these functions, they will be called before `main` and after the `main` function will be finished. Definition of `.init` and `.fini` sections located in the `/lib64/crti.o` and if we will add this object file:
|
||||
|
||||
```
|
||||
$ gcc -nostdlib /lib64/crt1.o /lib64/crti.o -lc -ggdb program.c -o program
|
||||
```
|
||||
|
||||
we will not get any errors. But let's try to run our program and see what happens:
|
||||
|
||||
```
|
||||
$ ./program
|
||||
Segmentation fault (core dumped)
|
||||
```
|
||||
|
||||
Yeah, we got segmentation fault. Let's look inside of the `lib64/crti.o` with `objdump` util:
|
||||
|
||||
```
|
||||
~$ objdump -D /lib64/crti.o
|
||||
|
||||
/lib64/crti.o: file format elf64-x86-64
|
||||
|
||||
|
||||
Disassembly of section .init:
|
||||
|
||||
0000000000000000 <_init>:
|
||||
0: 48 83 ec 08 sub $0x8,%rsp
|
||||
4: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # b <_init+0xb>
|
||||
b: 48 85 c0 test %rax,%rax
|
||||
e: 74 05 je 15 <_init+0x15>
|
||||
10: e8 00 00 00 00 callq 15 <_init+0x15>
|
||||
|
||||
Disassembly of section .fini:
|
||||
|
||||
0000000000000000 <_fini>:
|
||||
0: 48 83 ec 08 sub $0x8,%rsp
|
||||
```
|
||||
|
||||
As I wrote above, the `/lib64/crti.o` object file contains definition of the `.init` and `.fini` section, but also we can see here the stub for function. Let's look at the source code which is placed in the [sysdeps/x86_64/crti.S](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/crti.S;h=e9d86ed08ab134a540e3dae5f97a9afb82cdb993;hb=HEAD) source code file:
|
||||
|
||||
```assembly
|
||||
.section .init,"ax",@progbits
|
||||
.p2align 2
|
||||
.globl _init
|
||||
.type _init, @function
|
||||
_init:
|
||||
subq $8, %rsp
|
||||
movq PREINIT_FUNCTION@GOTPCREL(%rip), %rax
|
||||
testq %rax, %rax
|
||||
je .Lno_weak_fn
|
||||
call *%rax
|
||||
.Lno_weak_fn:
|
||||
call PREINIT_FUNCTION
|
||||
```
|
||||
|
||||
It contains definition of the `.init` section and assembly code does 16-byte stack alignment and next we move address of the `PREINIT_FUNCTION` and if it is zero we don't call it:
|
||||
|
||||
```
|
||||
00000000004003c8 <_init>:
|
||||
4003c8: 48 83 ec 08 sub $0x8,%rsp
|
||||
4003cc: 48 8b 05 25 0c 20 00 mov 0x200c25(%rip),%rax # 600ff8 <_DYNAMIC+0x1d0>
|
||||
4003d3: 48 85 c0 test %rax,%rax
|
||||
4003d6: 74 05 je 4003dd <_init+0x15>
|
||||
4003d8: e8 43 00 00 00 callq 400420 <__libc_start_main@plt+0x10>
|
||||
4003dd: 48 83 c4 08 add $0x8,%rsp
|
||||
4003e1: c3 retq
|
||||
```
|
||||
|
||||
where the `PREINIT_FUNCTION` is the `__gmon_start__` which does setup for profiling. You may note that we have no return instruction in the [sysdeps/x86_64/crti.S](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/crti.S;h=e9d86ed08ab134a540e3dae5f97a9afb82cdb993;hb=HEAD). Actually that's why we got segmentation fault. Prolog of `_init` and `_fini` is placed in the [sysdeps/x86_64/crtn.S](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/crtn.S;h=e9d86ed08ab134a540e3dae5f97a9afb82cdb993;hb=HEAD) assembly file:
|
||||
|
||||
```assembly
|
||||
.section .init,"ax",@progbits
|
||||
addq $8, %rsp
|
||||
ret
|
||||
|
||||
.section .fini,"ax",@progbits
|
||||
addq $8, %rsp
|
||||
ret
|
||||
```
|
||||
|
||||
and if we will add it to the compilation, our program will be successfully compiled and runned!
|
||||
|
||||
```
|
||||
~$ gcc -nostdlib /lib64/crt1.o /lib64/crti.o /lib64/crtn.o -lc -ggdb program.c -o program
|
||||
|
||||
~$ ./program
|
||||
x + y + z = 6
|
||||
```
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Now let's return to the `_start` function and try to go through a full chain of calls before the `main` of our program will be called.
|
||||
|
||||
The `_start` is always placed at the beginning of the `.text` section in our programs by the linked which is used default `ld` script:
|
||||
|
||||
```
|
||||
~$ ld --verbose | grep ENTRY
|
||||
ENTRY(_start)
|
||||
```
|
||||
|
||||
The `_start` function defined in the [sysdeps/x86_64/start.S](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/start.S;h=f1b961f5ba2d6a1ebffee0005f43123c4352fbf4;hb=HEAD) assembly file and does preparation like getting `argc/argv` from the stack, stack preparation and etc., before the `__libc_start_main` function will be called. The `__libc_start_main` function from the [csu/libc-start.c](https://sourceware.org/git/?p=glibc.git;a=blob;f=csu/libc-start.c;h=0fb98f1606bab475ab5ba2d0fe08c64f83cce9df;hb=HEAD) source code file does a registration of the constructor and destructor of application which are will be called before `main` and after it, starts up threading, does some security related actions like setting stack canary if need, calls initialization reltad routines and in the end it calls `main` function of our application and exit with its result:
|
||||
|
||||
```C
|
||||
result = main (argc, argv, __environ MAIN_AUXVEC_PARAM);
|
||||
exit (result);
|
||||
```
|
||||
|
||||
That's all.
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [system call](https://en.wikipedia.org/wiki/System_call)
|
||||
* [gdb](https://www.gnu.org/software/gdb/)
|
||||
* [execve](http://linux.die.net/man/2/execve)
|
||||
* [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [segment registers](https://en.wikipedia.org/wiki/X86_memory_segmentation)
|
||||
* [context switch](https://en.wikipedia.org/wiki/Context_switch)
|
||||
* [System V ABI](https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf)
|
@ -0,0 +1,10 @@
|
||||
# Synchronization primitives in the Linux kernel.
|
||||
|
||||
This chapter describes synchronization primitives in the Linux kernel.
|
||||
|
||||
* [Introduction to spinlocks](http://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) - the first part of this chapter describes implementation of spinlock mechanism in the Linux kernel.
|
||||
* [Queued spinlocks](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-2.html) - the second part describes another type of spinlocks - queued spinlocks.
|
||||
* [Semaphores](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-3.html) - this part describes implementation of `semaphore` synchronization primitive in the Linux kernel.
|
||||
* [Mutual exclusion](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-4.html) - this part describes - `mutex` in the Linux kernel.
|
||||
* [Reader/Writer semaphores](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-5.html) - this part describes special type of semaphores - `reader/writer` semaphores.
|
||||
* [Sequential locks](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-6.html) - this part describes sequential locks in the Linux kernel.
|
@ -0,0 +1,433 @@
|
||||
Synchronization primitives in the Linux kernel. Part 1.
|
||||
================================================================================
|
||||
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This part opens new chapter in the [linux-insides](http://0xax.gitbooks.io/linux-insides/content/) book. Timers and time management related stuff was described in the previous [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html). Now time to go next. As you may understand from the part's title, this chapter will describe [synchronization](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) primitives in the Linux kernel.
|
||||
|
||||
As always, before we will consider something synchronization related, we will try to know what is `synchronization primitive` in general. Actually, synchronization primitive is a software mechanism which provides ability to two or more [parallel](https://en.wikipedia.org/wiki/Parallel_computing) processes or threads to not execute simultaneously one the same segment of a code. For example let's look on the following piece of code:
|
||||
|
||||
```C
|
||||
mutex_lock(&clocksource_mutex);
|
||||
...
|
||||
...
|
||||
...
|
||||
clocksource_enqueue(cs);
|
||||
clocksource_enqueue_watchdog(cs);
|
||||
clocksource_select();
|
||||
...
|
||||
...
|
||||
...
|
||||
mutex_unlock(&clocksource_mutex);
|
||||
```
|
||||
|
||||
from the [kernel/time/clocksource.c](https://github.com/torvalds/linux/master/kernel/time/clocksource.c) source code file. This code is from the `__clocksource_register_scale` function which adds the given [clocksource](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html) to the clock sources list. This function produces different operations on a list with registered clock sources. For example the `clocksource_enqueue` function adds the given clock source to the list with registered clocksources - `clocksource_list`. Note that these lines of code wrapped to two functions: `mutex_lock` and `mutex_unlock` which are takes one parameter - the `clocksource_mutex` in our case.
|
||||
|
||||
These functions represents locking and unlocking based on [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) synchronization primitive. As `mutex_lock` will be executed, it allows us to prevent situation when two or more threads will execute this code while the `mute_unlock` will not be executed by process-owner of the mutex. In other words, we prevent parallel operations on a `clocksource_list`. Why do we need `mutex` here? What if two parallel processes will try to register a clock source. As we already know, the `clocksource_enqueue` function adds the given clock source to the `clocksource_list` list right after a clock source in the list which has the biggest rating (a registered clock source which has the highest frequency in the system):
|
||||
|
||||
```C
|
||||
static void clocksource_enqueue(struct clocksource *cs)
|
||||
{
|
||||
struct list_head *entry = &clocksource_list;
|
||||
struct clocksource *tmp;
|
||||
|
||||
list_for_each_entry(tmp, &clocksource_list, list)
|
||||
if (tmp->rating >= cs->rating)
|
||||
entry = &tmp->list;
|
||||
list_add(&cs->list, entry);
|
||||
}
|
||||
```
|
||||
|
||||
If two parallel processes will try to do it simultaneously, both process may found the same `entry` may occur [race condition](https://en.wikipedia.org/wiki/Race_condition) or in other words, the second process which will execute `list_add`, will overwrite a clock source from first thread.
|
||||
|
||||
Besides this simple example, synchronization primitives are ubiquitous in the Linux kernel. If we will go through the previous [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) or other chapters again or if we will look at the Linux kernel source code in general, we will meet many places like this. We will not consider how `mutex` is implemented in the Linux kernel. Actually, the Linux kernel provides a set of different synchronization primitives like:
|
||||
|
||||
* `mutex`;
|
||||
* `semaphores`;
|
||||
* `seqlocks`;
|
||||
* `atomic operations`;
|
||||
* etc.
|
||||
|
||||
We will start this chapter from the `spinlock`.
|
||||
|
||||
Spinlocks in the Linux kernel.
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The `spinlock` is a low-level synchronization mechanism which in simple words, represents a variable which can be in two states:
|
||||
|
||||
* `acquired`;
|
||||
* `released`.
|
||||
|
||||
Each process which wants to acquire a `spinlock`, must write a value which represents `spinlock acquired` state to this variable and write `spinlock released` state to the variable. If a process tries to execute code which is protected by a `spinlock`, it will be locked while a process which holds this lock will release it. In this case all related operations must be [atomic](https://en.wikipedia.org/wiki/Linearizability) to prevent [race conditions](https://en.wikipedia.org/wiki/Race_condition) state. The `spinlock` is represented by the `spinlock_t` type in the Linux kernel. If we will look at the Linux kernel code, we will see that this type is [widely](http://lxr.free-electrons.com/ident?i=spinlock_t) used. The `spinlock_t` is defined as:
|
||||
|
||||
```C
|
||||
typedef struct spinlock {
|
||||
union {
|
||||
struct raw_spinlock rlock;
|
||||
|
||||
#ifdef CONFIG_DEBUG_LOCK_ALLOC
|
||||
# define LOCK_PADSIZE (offsetof(struct raw_spinlock, dep_map))
|
||||
struct {
|
||||
u8 __padding[LOCK_PADSIZE];
|
||||
struct lockdep_map dep_map;
|
||||
};
|
||||
#endif
|
||||
};
|
||||
} spinlock_t;
|
||||
```
|
||||
|
||||
and located in the [include/linux/spinlock_types.h](https://github.com/torvalds/linux/master/include/linux/spinlock_types.h) header file. We may see that its implementation depends on the state of the `CONFIG_DEBUG_LOCK_ALLOC` kernel configuration option. We will skip this now, because all debugging related stuff will be in the end of this part. So, if the `CONFIG_DEBUG_LOCK_ALLOC` kernel configuration option is disabled, the `spinlock_t` contains [union](https://en.wikipedia.org/wiki/Union_type#C.2FC.2B.2B) with one field which is - `raw_spinlock`:
|
||||
|
||||
```C
|
||||
typedef struct spinlock {
|
||||
union {
|
||||
struct raw_spinlock rlock;
|
||||
};
|
||||
} spinlock_t;
|
||||
```
|
||||
|
||||
The `raw_spinlock` structure defined in the [same](https://github.com/torvalds/linux/master/include/linux/spinlock_types.h) header file and represents implementation of `normal` spinlock. Let's look how the `raw_spinlock` structure is defined:
|
||||
|
||||
```C
|
||||
typedef struct raw_spinlock {
|
||||
arch_spinlock_t raw_lock;
|
||||
#ifdef CONFIG_GENERIC_LOCKBREAK
|
||||
unsigned int break_lock;
|
||||
#endif
|
||||
} raw_spinlock_t;
|
||||
```
|
||||
|
||||
where the `arch_spinlock_t` represents architecture-specific `spinlock` implementation and the `break_lock` field which holds value - `1` in a case when one processor starts to wait while the lock is held on another processor on [SMP](https://en.wikipedia.org/wiki/Symmetric_multiprocessing) systems. This allows prevent long time locking. As consider the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture in this books, so the `arch_spinlock_t` is defined in the [arch/x86/include/asm/spinlock_types.h](https://github.com/torvalds/linux/master/arch/x86/include/asm/spinlock_types.h) header file and looks:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_QUEUED_SPINLOCKS
|
||||
#include <asm-generic/qspinlock_types.h>
|
||||
#else
|
||||
typedef struct arch_spinlock {
|
||||
union {
|
||||
__ticketpair_t head_tail;
|
||||
struct __raw_tickets {
|
||||
__ticket_t head, tail;
|
||||
} tickets;
|
||||
};
|
||||
} arch_spinlock_t;
|
||||
```
|
||||
|
||||
As we may see, the definition of the `arch_spinlock` structure depends on the value of the `CONFIG_QUEUED_SPINLOCKS` kernel configuration option. This configuration option the Linux kernel supports `spinlocks` with queue. This special type of `spinlocks` which instead of `acquired` and `released` [atomic](https://en.wikipedia.org/wiki/Linearizability) values used `atomic` operation on a `queue`. If the `CONFIG_QUEUED_SPINLOCKS` kernel configuration option is enabled, the `arch_spinlock_t` will be represented by the following structure:
|
||||
|
||||
```C
|
||||
typedef struct qspinlock {
|
||||
atomic_t val;
|
||||
} arch_spinlock_t;
|
||||
```
|
||||
|
||||
from the [include/asm-generic/qspinlock_types.h](https://github.com/torvalds/linux/master/include/asm-generic/qspinlock_types.h) header file.
|
||||
|
||||
We will not stop on this structures for now and before we will consider both `arch_spinlock` and the `qspinlock`, let's look at the operations on a spinlock. The Linux kernel provides following main operations on a `spinlock`:
|
||||
|
||||
* `spin_lock_init` - produces initialization of the given `spinlock`;
|
||||
* `spin_lock` - acquires given `spinlock`;
|
||||
* `spin_lock_bh` - disables software [interrupts](https://en.wikipedia.org/wiki/Interrupt) and acquire given `spinlock`.
|
||||
* `spin_lock_irqsave` and `spin_lock_irq` - disable interrupts on local processor and preserve/not preserve previous interrupt state in the `flags`;
|
||||
* `spin_unlock` - releases given `spinlock`;
|
||||
* `spin_unlock_bh` - releases given `spinlock` and enables software interrupts;
|
||||
* `spin_is_locked` - returns the state of the given `spinlock`;
|
||||
* and etc.
|
||||
|
||||
Let's look on the implementation of the `spin_lock_init` macro. As I already wrote, this and other macro are defined in the [include/linux/spinlock.h](https://github.com/torvalds/linux/master/include/linux/spinlock.h) header file and the `spin_lock_init` macro looks:
|
||||
|
||||
```C
|
||||
#define spin_lock_init(_lock) \
|
||||
do { \
|
||||
spinlock_check(_lock); \
|
||||
raw_spin_lock_init(&(_lock)->rlock); \
|
||||
} while (0)
|
||||
```
|
||||
|
||||
As we may see, the `spin_lock_init` macro takes a `spinlock` and executes two operations: check the given `spinlock` and execute the `raw_spin_lock_init`. The implementation of the `spinlock_check` is pretty easy, this function just returns the `raw_spinlock_t` of the given `spinlock` to be sure that we got exactly `normal` raw spinlock:
|
||||
|
||||
```C
|
||||
static __always_inline raw_spinlock_t *spinlock_check(spinlock_t *lock)
|
||||
{
|
||||
return &lock->rlock;
|
||||
}
|
||||
```
|
||||
|
||||
The `raw_spin_lock_init` macro:
|
||||
|
||||
```C
|
||||
# define raw_spin_lock_init(lock) \
|
||||
do { \
|
||||
*(lock) = __RAW_SPIN_LOCK_UNLOCKED(lock); \
|
||||
} while (0) \
|
||||
```
|
||||
|
||||
assigns the value of the `__RAW_SPIN_LOCK_UNLOCKED` with the given `spinlock` to the given `raw_spinlock_t`. As we may understand from the name of the `__RAW_SPIN_LOCK_UNLOCKED` macro, this macro does initialization of the given `spinlock` and set it to `released` state. This macro defined in the [include/linux/spinlock_types.h](https://github.com/torvalds/linux/master/include/linux/spinlock_types.h) header file and expands to the following macros:
|
||||
|
||||
```C
|
||||
#define __RAW_SPIN_LOCK_UNLOCKED(lockname) \
|
||||
(raw_spinlock_t) __RAW_SPIN_LOCK_INITIALIZER(lockname)
|
||||
|
||||
#define __RAW_SPIN_LOCK_INITIALIZER(lockname) \
|
||||
{ \
|
||||
.raw_lock = __ARCH_SPIN_LOCK_UNLOCKED, \
|
||||
SPIN_DEBUG_INIT(lockname) \
|
||||
SPIN_DEP_MAP_INIT(lockname) \
|
||||
}
|
||||
```
|
||||
|
||||
As I already wrote above, we will not consider stuff which is related to debugging of synchronization primitives. In this case we will not consider the `SPIN_DEBUG_INIT` and the `SPIN_DEP_MAP_INIT` macros. So the `__RAW_SPINLOCK_UNLOCKED` macro will be expanded to the:
|
||||
|
||||
```C
|
||||
*(&(_lock)->rlock) = __ARCH_SPIN_LOCK_UNLOCKED;
|
||||
```
|
||||
|
||||
where the `__ARCH_SPIN_LOCK_UNLOCKED` is:
|
||||
|
||||
```C
|
||||
#define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } }
|
||||
```
|
||||
|
||||
and:
|
||||
|
||||
```C
|
||||
#define __ARCH_SPIN_LOCK_UNLOCKED { ATOMIC_INIT(0) }
|
||||
```
|
||||
|
||||
for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture. if the `CONFIG_QUEUED_SPINLOCKS` kernel configuration option is enabled. So, after the expansion of the `spin_lock_init` macro, a given `spinlock` will be initialized and its state will be - `unlocked`.
|
||||
|
||||
From this moment we know how to initialize a `spinlock`, now let's consider [API](https://en.wikipedia.org/wiki/Application_programming_interface) which Linux kernel provides for manipulations of `spinlocks`. The first is:
|
||||
|
||||
```C
|
||||
static __always_inline void spin_lock(spinlock_t *lock)
|
||||
{
|
||||
raw_spin_lock(&lock->rlock);
|
||||
}
|
||||
```
|
||||
|
||||
function which allows us to `acquire` a spinlock. The `raw_spin_lock` macro is defined in the same header file and expands to the call of the `_raw_spin_lock` function:
|
||||
|
||||
```C
|
||||
#define raw_spin_lock(lock) _raw_spin_lock(lock)
|
||||
```
|
||||
|
||||
As we may see in the [include/linux/spinlock.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock.h) header file, definition of the `_raw_spin_lock` macro depends on the `CONFIG_SMP` kernel configuration parameter:
|
||||
|
||||
```C
|
||||
#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
|
||||
# include <linux/spinlock_api_smp.h>
|
||||
#else
|
||||
# include <linux/spinlock_api_up.h>
|
||||
#endif
|
||||
```
|
||||
|
||||
So, if the [SMP](https://en.wikipedia.org/wiki/Symmetric_multiprocessing) is enabled in the Linux kernel, the `_raw_spin_lock` macro is defined in the [arch/x86/include/asm/spinlock.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/spinlock.h) header file and looks like:
|
||||
|
||||
```C
|
||||
#define _raw_spin_lock(lock) __raw_spin_lock(lock)
|
||||
```
|
||||
|
||||
The `__raw_spin_lock` function looks:
|
||||
|
||||
```C
|
||||
static inline void __raw_spin_lock(raw_spinlock_t *lock)
|
||||
{
|
||||
preempt_disable();
|
||||
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
|
||||
LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
|
||||
}
|
||||
```
|
||||
|
||||
As you may see, first of all we disable [preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29) by the call of the `preempt_disable` macro from the [include/linux/preempt.h](https://github.com/torvalds/linux/blob/master/include/linux/preempt.h) (more about this you may read in the ninth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-9.html) of the Linux kernel initialization process chapter). When we will unlock the given `spinlock`, preemption will be enabled again:
|
||||
|
||||
```C
|
||||
static inline void __raw_spin_unlock(raw_spinlock_t *lock)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
preempt_enable();
|
||||
}
|
||||
```
|
||||
|
||||
We need to do this while a process is spinning on a lock, other processes must be prevented to preempt the process which acquired a lock. The `spin_acquire` macro which through a chain of other macros expands to the call of the:
|
||||
|
||||
```C
|
||||
#define spin_acquire(l, s, t, i) lock_acquire_exclusive(l, s, t, NULL, i)
|
||||
#define lock_acquire_exclusive(l, s, t, n, i) lock_acquire(l, s, t, 0, 1, n, i)
|
||||
```
|
||||
|
||||
`lock_acquire` function:
|
||||
|
||||
```C
|
||||
void lock_acquire(struct lockdep_map *lock, unsigned int subclass,
|
||||
int trylock, int read, int check,
|
||||
struct lockdep_map *nest_lock, unsigned long ip)
|
||||
{
|
||||
unsigned long flags;
|
||||
|
||||
if (unlikely(current->lockdep_recursion))
|
||||
return;
|
||||
|
||||
raw_local_irq_save(flags);
|
||||
check_flags(flags);
|
||||
|
||||
current->lockdep_recursion = 1;
|
||||
trace_lock_acquire(lock, subclass, trylock, read, check, nest_lock, ip);
|
||||
__lock_acquire(lock, subclass, trylock, read, check,
|
||||
irqs_disabled_flags(flags), nest_lock, ip, 0, 0);
|
||||
current->lockdep_recursion = 0;
|
||||
raw_local_irq_restore(flags);
|
||||
}
|
||||
```
|
||||
|
||||
As I wrote above, we will not consider stuff here which is related to debugging or tracing. The main point of the `lock_acquire` function is to disable hardware interrupts by the call of the `raw_local_irq_save` macro, because the given spinlock might be acquired with enabled hardware interrupts. In this way the process will not be preempted. Note that in the end of the `lock_acquire` function we will enable hardware interrupts again with the help of the `raw_local_irq_restore` macro. As you already may guess, the main work will be in the `__lock_acquire` function which is defined in the [kernel/locking/lockdep.c](https://github.com/torvalds/linux/blob/master/kernel/locking/lockdep.c) source code file.
|
||||
|
||||
The `__lock_acquire` function looks big. We will try to understand what does this function do, but not in this part. Actually this function mostly related to the Linux kernel [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) and it is not topic of this part. If we will return to the definition of the `__raw_spin_lock` function, we will see that it contains the following definition in the end:
|
||||
|
||||
```C
|
||||
LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
|
||||
```
|
||||
|
||||
The `LOCK_CONTENDED` macro is defined in the [include/linux/lockdep.h](https://github.com/torvalds/linux/blob/master/include/linux/lockdep.h) header file and just calls the given function with the given `spinlock`:
|
||||
|
||||
```C
|
||||
#define LOCK_CONTENDED(_lock, try, lock) \
|
||||
lock(_lock)
|
||||
```
|
||||
|
||||
In our case, the `lock` is `do_raw_spin_lock` function from the [include/linux/spinlock.h](https://github.com/torvalds/linux/blob/master/include/linux/spnlock.h) header file and the `_lock` is the given `raw_spinlock_t`:
|
||||
|
||||
```C
|
||||
static inline void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock)
|
||||
{
|
||||
__acquire(lock);
|
||||
arch_spin_lock(&lock->raw_lock);
|
||||
}
|
||||
```
|
||||
|
||||
The `__acquire` here is just [sparse](https://en.wikipedia.org/wiki/Sparse) related macro and we are not interesting in it in this moment. Location of the definition of the `arch_spin_lock` function depends on two things: the first is architecture of system and the second do we use `queued spinlocks` or not. In our case we consider only `x86_64` architecture, so the definition of the `arch_spin_lock` is represented as the macro from the [include/asm-generic/qspinlock.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlocks.h) header file:
|
||||
|
||||
```C
|
||||
#define arch_spin_lock(l) queued_spin_lock(l)
|
||||
```
|
||||
|
||||
if we are using `queued spinlocks`. Or in other case, the `arch_spin_lock` function is defined in the [arch/x86/include/asm/spinlock.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/spinlock.h) header file. Now we will consider only `normal spinlock` and information related to `queued spinlocks` we will see later. Let's look again on the definition of the `arch_spinlock` structure, to understand implementation of the `arch_spin_lock` function:
|
||||
|
||||
```C
|
||||
typedef struct arch_spinlock {
|
||||
union {
|
||||
__ticketpair_t head_tail;
|
||||
struct __raw_tickets {
|
||||
__ticket_t head, tail;
|
||||
} tickets;
|
||||
};
|
||||
} arch_spinlock_t;
|
||||
```
|
||||
|
||||
This variant of `spinlock` is called - `ticket spinlock`. As we may see, it consists from two parts. When lock is acquired, it increments a `tail` by one every time when a process wants to hold a `spinlock`. If the `tail` is not equal to `head`, the process will be locked, until values of these variables will not be equal. Let's look on the implementation of the `arch_spin_lock` function:
|
||||
|
||||
```C
|
||||
static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
|
||||
{
|
||||
register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };
|
||||
|
||||
inc = xadd(&lock->tickets, inc);
|
||||
|
||||
if (likely(inc.head == inc.tail))
|
||||
goto out;
|
||||
|
||||
for (;;) {
|
||||
unsigned count = SPIN_THRESHOLD;
|
||||
|
||||
do {
|
||||
inc.head = READ_ONCE(lock->tickets.head);
|
||||
if (__tickets_equal(inc.head, inc.tail))
|
||||
goto clear_slowpath;
|
||||
cpu_relax();
|
||||
} while (--count);
|
||||
__ticket_lock_spinning(lock, inc.tail);
|
||||
}
|
||||
clear_slowpath:
|
||||
__ticket_check_and_clear_slowpath(lock, inc.head);
|
||||
out:
|
||||
barrier();
|
||||
}
|
||||
```
|
||||
|
||||
At the beginning of the `arch_spin_lock` function we can initialization of the `__raw_tickets` structure with `tail` - `1`:
|
||||
|
||||
```C
|
||||
#define __TICKET_LOCK_INC 1
|
||||
```
|
||||
|
||||
In the next line we execute [xadd](http://x86.renejeschke.de/html/file_module_x86_id_327.html) operation on the `inc` and `lock->tickets`. After this operation the `inc` will store value of the `tickets` of the given `lock` and the `tickets.tail` will be increased on `inc` or `1`. The `tail` value was increased on `1` which means that one process started to try to hold a lock. In the next step we do the check that checks that `head` and `tail` have the same value. If these values are equal, this means that nobody holds lock and we go to the `out` label. In the end of the `arch_spin_lock` function we may see the `barrier` macro which represents `barrier instruction` which guarantees that compiler will not change order of operations that access memory (more about memory barriers you can read in the kernel [documentation](https://www.kernel.org/doc/Documentation/memory-barriers.txt)).
|
||||
|
||||
If one process held a lock and a second process started to execute the `arch_spin_lock` function, the `head` will not be `equal` to `tail`, because the `tail` will be greater than `head` on `1`. In this way, process will occur in the loop. There will be comparison between `head` and the `tail` values at each loop iteration. If these values are not equal, the `cpu_relax` will be called which is just [NOP](https://en.wikipedia.org/wiki/NOP) instruction:
|
||||
|
||||
|
||||
```C
|
||||
#define cpu_relax() asm volatile("rep; nop")
|
||||
```
|
||||
|
||||
and the next iteration of the loop will be started. If these values will be equal, this means that the process which held this lock, released this lock and the next process may acquire the lock.
|
||||
|
||||
The `spin_unlock` operation goes through the all macros/function as `spin_lock`, of course with `unlock` prefix. In the end the `arch_spin_unlock` function will be called. If we will look at the implementation of the `arch_spin_lock` function, we will see that it increases `head` of the `lock tickets` list:
|
||||
|
||||
```C
|
||||
__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
|
||||
```
|
||||
|
||||
In a combination of the `spin_lock` and `spin_unlock`, we get kind of queue where `head` contains an index number which maps currently executed process which holds a lock and the `tail` which contains an index number which maps last process which tried to hold the lock:
|
||||
|
||||
```
|
||||
+-------+ +-------+
|
||||
| | | |
|
||||
head | 7 | - - - | 7 | tail
|
||||
| | | |
|
||||
+-------+ +-------+
|
||||
|
|
||||
+-------+
|
||||
| |
|
||||
| 8 |
|
||||
| |
|
||||
+-------+
|
||||
|
|
||||
+-------+
|
||||
| |
|
||||
| 9 |
|
||||
| |
|
||||
+-------+
|
||||
```
|
||||
|
||||
That's all for now. We didn't cover `spinlock` API in full in this part, but I think that the main idea behind this concept must be clear now.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This concludes the first part covering synchronization primitives in the Linux kernel. In this part, we met first synchronization primitive `spinlock` provided by the Linux kernel. In the next part we will continue to dive into this interesting theme and will see other `synchronization` related stuff.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [Concurrent computing](https://en.wikipedia.org/wiki/Concurrent_computing)
|
||||
* [Synchronization](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29)
|
||||
* [Clocksource framework](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html)
|
||||
* [Mutex](https://en.wikipedia.org/wiki/Mutual_exclusion)
|
||||
* [Race condition](https://en.wikipedia.org/wiki/Race_condition)
|
||||
* [Atomic operations](https://en.wikipedia.org/wiki/Linearizability)
|
||||
* [SMP](https://en.wikipedia.org/wiki/Symmetric_multiprocessing)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [Interrupts](https://en.wikipedia.org/wiki/Interrupt)
|
||||
* [Preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29)
|
||||
* [Linux kernel lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt)
|
||||
* [Sparse](https://en.wikipedia.org/wiki/Sparse)
|
||||
* [xadd instruction](http://x86.renejeschke.de/html/file_module_x86_id_327.html)
|
||||
* [NOP](https://en.wikipedia.org/wiki/NOP)
|
||||
* [Memory barriers](https://www.kernel.org/doc/Documentation/memory-barriers.txt)
|
||||
* [Previous chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html)
|
@ -0,0 +1,487 @@
|
||||
Synchronization primitives in the Linux kernel. Part 2.
|
||||
================================================================================
|
||||
|
||||
Queued Spinlocks
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the second part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html) which describes synchronization primitives in the Linux kernel and in the first [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) of this chapter we met the first - [spinlock](https://en.wikipedia.org/wiki/Spinlock). We will continue to learn this synchronization primitive in this part. If you have read the previous part, you may remember that besides normal spinlocks, the Linux kernel provides special type of `spinlocks` - `queued spinlocks`. In this part we will try to understand what does this concept represent.
|
||||
|
||||
We saw [API](https://en.wikipedia.org/wiki/Application_programming_interface) of `spinlock` in the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html):
|
||||
|
||||
* `spin_lock_init` - produces initialization of the given `spinlock`;
|
||||
* `spin_lock` - acquires given `spinlock`;
|
||||
* `spin_lock_bh` - disables software [interrupts](https://en.wikipedia.org/wiki/Interrupt) and acquire given `spinlock`.
|
||||
* `spin_lock_irqsave` and `spin_lock_irq` - disable interrupts on local processor and preserve/not preserve previous interrupt state in the `flags`;
|
||||
* `spin_unlock` - releases given `spinlock`;
|
||||
* `spin_unlock_bh` - releases given `spinlock` and enables software interrupts;
|
||||
* `spin_is_locked` - returns the state of the given `spinlock`;
|
||||
* and etc.
|
||||
|
||||
And we know that all of these macro which are defined in the [include/linux/spinlock.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock.h) header file will be expanded to the call of the functions with `arch_spin_.*` prefix from the [arch/x86/include/asm/spinlock.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/spinlock.h) for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture. If we will look at this header fill with attention, we will that these functions (`arch_spin_is_locked`, `arch_spin_lock`, `arch_spin_unlock` and etc) defined only if the `CONFIG_QUEUED_SPINLOCKS` kernel configuration option is disabled:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_QUEUED_SPINLOCKS
|
||||
#include <asm/qspinlock.h>
|
||||
#else
|
||||
static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
...
|
||||
...
|
||||
...
|
||||
#endif
|
||||
```
|
||||
|
||||
This means that the [arch/x86/include/asm/qspinlock.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/qspinlock.h) header file provides own implementation of these functions. Actually they are macros and they are located in other header file. This header file is - [include/asm-generic/qspinlock.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock.h#L126). If we will look into this header file, we will find definition of these macros:
|
||||
|
||||
```C
|
||||
#define arch_spin_is_locked(l) queued_spin_is_locked(l)
|
||||
#define arch_spin_is_contended(l) queued_spin_is_contended(l)
|
||||
#define arch_spin_value_unlocked(l) queued_spin_value_unlocked(l)
|
||||
#define arch_spin_lock(l) queued_spin_lock(l)
|
||||
#define arch_spin_trylock(l) queued_spin_trylock(l)
|
||||
#define arch_spin_unlock(l) queued_spin_unlock(l)
|
||||
#define arch_spin_lock_flags(l, f) queued_spin_lock(l)
|
||||
#define arch_spin_unlock_wait(l) queued_spin_unlock_wait(l)
|
||||
```
|
||||
|
||||
Before we will consider how queued spinlocks and their [API](https://en.wikipedia.org/wiki/Application_programming_interface) are implemented, we take a look on theoretical part at first.
|
||||
|
||||
Introduction to queued spinlocks
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
Queued spinlocks is a [locking mechanism](https://en.wikipedia.org/wiki/Lock_%28computer_science%29) in the Linux kernel which is replacement for the standard `spinlocks`. At least this is true for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture. If we will look at the following kernel configuration file - [kernel/Kconfig.locks](https://github.com/torvalds/linux/blob/master/kernel/Kconfig.locks), we will see following configuration entries:
|
||||
|
||||
```
|
||||
config ARCH_USE_QUEUED_SPINLOCKS
|
||||
bool
|
||||
|
||||
config QUEUED_SPINLOCKS
|
||||
def_bool y if ARCH_USE_QUEUED_SPINLOCKS
|
||||
depends on SMP
|
||||
```
|
||||
|
||||
This means that the `CONFIG_QUEUED_SPINLOCKS` kernel configuration option will be enabled by default if the `ARCH_USE_QUEUED_SPINLOCKS` is enabled. We may see that the `ARCH_USE_QUEUED_SPINLOCKS` is enabled by default in the `x86_64` specific kernel configuration file - [arch/x86/Kconfig](https://github.com/torvalds/linux/blob/master/arch/x86/Kconfig):
|
||||
|
||||
```
|
||||
config X86
|
||||
...
|
||||
...
|
||||
...
|
||||
select ARCH_USE_QUEUED_SPINLOCKS
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
Before we will start to consider what is it queued spinlock concept, let's look on other types of `spinlocks`. For the start let's consider how `normal` spinlocks is implemented. Usually, implementation of `normal` spinlock is based on the [test and set](https://en.wikipedia.org/wiki/Test-and-set) instruction. Principle of work of this instruction is pretty simple. This instruction writes a value to the memory location and returns old value from this memory location. Both of these operations are in atomic context i.e. this instruction is non-interruptible. So if the first thread started to execute this instruction, second thread will wait until the first processor will not finish. Basic lock can be built on top of this mechanism. Schematically it may look like this:
|
||||
|
||||
```C
|
||||
int lock(lock)
|
||||
{
|
||||
while (test_and_set(lock) == 1)
|
||||
;
|
||||
return 0;
|
||||
}
|
||||
|
||||
int unlock(lock)
|
||||
{
|
||||
lock=0;
|
||||
|
||||
return lock;
|
||||
}
|
||||
```
|
||||
|
||||
The first thread will execute the `test_and_set` which will set the `lock` to `1`. When the second thread will call the `lock` function, it will spin in the `while` loop, until the first thread will not call the `unlock` function and the `lock` will be equal to `0`. This implementation is not very good for performance, because it has at least two problems. The first problem is that this implementation may be unfair and the thread from one processor may have long waiting time, even if it called the `lock` before other threads which are waiting for free lock too. The second problem is that all threads which want to acquire a lock, must to execute many `atomic` operations like `test_and_set` on a variable which is in shared memory. This leads to the cache invalidation as the cache of the processor will store `lock=1`, but the value of the `lock` in memory may be `1` after a thread will release this lock.
|
||||
|
||||
In the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) we saw the second type of spinlock implementation - `ticket spinlock`. This approach solves the first problem and may guarantee order of threads which want to acquire a lock, but still has a second problem.
|
||||
|
||||
The topic of this part is `queued spinlocks`. This approach may help to solve both of these problems. The `queued spinlocks` allows to each processor to use its own memory location to spin. The basic principle of a queue-based spinlock can best be understood by studying a classic queue-based spinlock implementation called the [MCS](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) lock. Before we will look at implementation of the `queued spinlocks` in the Linux kernel, we will try to understand what is it `MCS` lock.
|
||||
|
||||
The basic idea of the `MCS` lock is in that as I already wrote in the previous paragraph, a thread spins on a local variable and each processor in the system has its own copy of these variable. In other words this concept is built on top of the [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variables concept in the Linux kernel.
|
||||
|
||||
When the first thread wants to acquire a lock, it registers itself in the `queue` or in other words it will be added to the special `queue` and will acquire lock, because it is free for now. When the second thread will want to acquire the same lock before the first thread will release it, this thread adds its own copy of the lock variable into this `queue`. In this case the first thread will contain a `next` field which will point to the second thread. From this moment, the second thread will wait until the first thread will release its lock and notify `next` thread about this event. The first thread will be deleted from the `queue` and the second thread will be owner of a lock.
|
||||
|
||||
Schematically we can represent it like:
|
||||
|
||||
Empty queue:
|
||||
|
||||
```
|
||||
+---------+
|
||||
| |
|
||||
| Queue |
|
||||
| |
|
||||
+---------+
|
||||
```
|
||||
|
||||
First thread tries to acquire a lock:
|
||||
|
||||
```
|
||||
+---------+ +----------------------------+
|
||||
| | | |
|
||||
| Queue |---->| First thread acquired lock |
|
||||
| | | |
|
||||
+---------+ +----------------------------+
|
||||
```
|
||||
|
||||
Second thread tries to acquire a lock:
|
||||
|
||||
```
|
||||
+---------+ +----------------------------------------+ +-------------------------+
|
||||
| | | | | |
|
||||
| Queue |---->| Second thread waits for first thread |<----| First thread holds lock |
|
||||
| | | | | |
|
||||
+---------+ +----------------------------------------+ +-------------------------+
|
||||
```
|
||||
|
||||
Or the pseudocode:
|
||||
|
||||
```C
|
||||
void lock(...)
|
||||
{
|
||||
lock.next = NULL;
|
||||
ancestor = put_lock_to_queue_and_return_ancestor(queue, lock);
|
||||
|
||||
// if we have ancestor, the lock already acquired and we
|
||||
// need to wait until it will be released
|
||||
if (ancestor)
|
||||
{
|
||||
lock.locked = 1;
|
||||
ancestor.next = lock;
|
||||
|
||||
while (lock.is_locked == true)
|
||||
;
|
||||
}
|
||||
|
||||
// in other way we are owner of the lock and may exit
|
||||
}
|
||||
|
||||
void unlock(...)
|
||||
{
|
||||
// do we need to notify somebody or we are alonw in the
|
||||
// queue?
|
||||
if (lock.next != NULL) {
|
||||
// the while loop from the lock() function will be
|
||||
// finished
|
||||
lock.next.is_locked = false;
|
||||
// delete ourself from the queue and exit
|
||||
...
|
||||
...
|
||||
...
|
||||
return;
|
||||
}
|
||||
|
||||
// So, we have no next threads in the queue to notify about
|
||||
// lock releasing event. Let's just put `0` to the lock, will
|
||||
// delete ourself from the queue and exit.
|
||||
}
|
||||
```
|
||||
|
||||
The idea is simple, but the implementation of the `queued spinlocks` is must complex than this pseudocode. As I already wrote above, the `queued spinlock` mechanism is planned to be replacement for `ticket spinlocks` in the Linux kernel. But as you may remember, the usual `spinlock` fit into `32-bit` [word](https://en.wikipedia.org/wiki/Word_%28computer_architecture%29). But the `MCS` based lock does not fit to this size. As you may know `spinlock_t` type is [widely](http://lxr.free-electrons.com/ident?i=spinlock_t) used in the Linux kernel. In this case would have to rewrite a significant part of the Linux kernel, but this is unacceptable. Beside this, some kernel structures which contains a spinlock for protection can't grow. But anyway, implementation of the `queued spinlocks` in the Linux kernel based on this concept with some modifications which allows to fit it into `32` bits.
|
||||
|
||||
That's all about theory of the `queued spinlocks`, now let's consider how this mechanism is implemented in the Linux kernel. Implementation of the `queued spinlocks` looks more complex and tangled than implementation of `ticket spinlocks`, but the study with attention will lead to success.
|
||||
|
||||
API of queued spinlocks
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
Now we know a little about `queued spinlocks` from the theoretical side, time to see the implementation of this mechanism in the Linux kernel. As we saw above, the [include/asm-generic/qspinlock.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock.h#L126) header files provides a set of macro which are represent API for a spinlock acquiring, releasing and etc:
|
||||
|
||||
```C
|
||||
#define arch_spin_is_locked(l) queued_spin_is_locked(l)
|
||||
#define arch_spin_is_contended(l) queued_spin_is_contended(l)
|
||||
#define arch_spin_value_unlocked(l) queued_spin_value_unlocked(l)
|
||||
#define arch_spin_lock(l) queued_spin_lock(l)
|
||||
#define arch_spin_trylock(l) queued_spin_trylock(l)
|
||||
#define arch_spin_unlock(l) queued_spin_unlock(l)
|
||||
#define arch_spin_lock_flags(l, f) queued_spin_lock(l)
|
||||
#define arch_spin_unlock_wait(l) queued_spin_unlock_wait(l)
|
||||
```
|
||||
|
||||
All of these macros expand to the call of functions from the same header file. Additionally, we saw the `qspinlock` structure from the [include/asm-generic/qspinlock_types.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock_types.h) header file which represents a queued spinlock in the Linux kernel:
|
||||
|
||||
```C
|
||||
typedef struct qspinlock {
|
||||
atomic_t val;
|
||||
} arch_spinlock_t;
|
||||
```
|
||||
|
||||
As we may see, the `qspinlock` structure contains only one field - `val`. This field represents the state of a given `spinlock`. This `4` bytes field consists from following four parts:
|
||||
|
||||
* `0-7` - locked byte;
|
||||
* `8` - pending bit;
|
||||
* `16-17` - two bit index which represents entry of the `per-cpu` array of the `MCS` lock (will see it soon);
|
||||
* `18-31` - contains number of processor which indicates tail of the queue.
|
||||
|
||||
and the `9-15` bytes are not used.
|
||||
|
||||
As we already know, each processor in the system has own copy of the lock. The lock is represented by the following structure:
|
||||
|
||||
```C
|
||||
struct mcs_spinlock {
|
||||
struct mcs_spinlock *next;
|
||||
int locked;
|
||||
int count;
|
||||
};
|
||||
```
|
||||
|
||||
from the [kernel/locking/mcs_spinlock.h](https://github.com/torvalds/linux/blob/master/kernel/locking/mcs_spinlock.h) header file. The first field represents a pointer to the next thread in the `queue`. The second field represents the state of the current thread in the `queue`, where `1` is `lock` already acquired and `0` in other way. And the last field of the `mcs_spinlock` structure represents nested locks. To understand what is it nested lock, imagine situation when a thread acquired lock, but was interrupted by the hardware [interrupt](https://en.wikipedia.org/wiki/Interrupt) and an [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler) tries to take a lock too. For this case, each processor has not just copy of the `mcs_spinlock` structure but array of these structures:
|
||||
|
||||
```C
|
||||
static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]);
|
||||
```
|
||||
|
||||
This array allows to make four attempts of a lock acquisition for the four events in following contexts:
|
||||
|
||||
* normal task context;
|
||||
* hardware interrupt context;
|
||||
* software interrupt context;
|
||||
* non-maskable interrupt context.
|
||||
|
||||
Now let's return to the `qspinlock` structure and the `API` of the `queued spinlocks`. Before we will move to consider `API` of `queued spinlocks`, notice the `val` field of the `qspinlock` structure has type - `atomic_t` which represents atomic variable or one operation at a time variable. So, all operations with this field will be [atomic](https://en.wikipedia.org/wiki/Linearizability). For example let's look at the reading value of the `val` API:
|
||||
|
||||
```C
|
||||
static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
|
||||
{
|
||||
return atomic_read(&lock->val);
|
||||
}
|
||||
```
|
||||
|
||||
Ok, now we know data structures which represents queued spinlock in the Linux kernel and now time is to look at the implementation of the `main` function from the `queued spinlocks` [API](https://en.wikipedia.org/wiki/Application_programming_interface).
|
||||
|
||||
```C
|
||||
#define arch_spin_lock(l) queued_spin_lock(l)
|
||||
```
|
||||
|
||||
Yes, this function is - `queued_spin_lock`. As we may understand from the function's name, it allows to acquire lock by the thread. This function is defined in the [include/asm-generic/qspinlock_types.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock_types.h) header file and its implementation looks:
|
||||
|
||||
```C
|
||||
static __always_inline void queued_spin_lock(struct qspinlock *lock)
|
||||
{
|
||||
u32 val;
|
||||
|
||||
val = atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL);
|
||||
if (likely(val == 0))
|
||||
return;
|
||||
queued_spin_lock_slowpath(lock, val);
|
||||
}
|
||||
```
|
||||
|
||||
Looks pretty easy, except the `queued_spin_lock_slowpath` function. We may see that it takes only one parameter. In our case this parameter will represent `queued spinlock` which will be locked. Let's consider the situation that `queue` with locks is empty for now and the first thread wanted to acquire lock. As we may see the `queued_spin_lock` function starts from the call of the `atomic_cmpxchg_acquire` macro. As you may guess from the name of this macro, it executes atomic [CMPXCHG](http://x86.renejeschke.de/html/file_module_x86_id_41.html) instruction which compares value of the second parameter (zero in our case) with the value of the first parameter (current state of the given spinlock) and if they are identical, it stores value of the `_Q_LOCKED_VAL` in the memory location which is pointed by the `&lock->val` and return the initial value from this memory location.
|
||||
|
||||
The `atomic_cmpxchg_acquire` macro is defined in the [include/linux/atomic.h](https://github.com/torvalds/linux/blob/master/include/linux/atomic.h) header file and expands to the call of the `atomic_cmpxchg` function:
|
||||
|
||||
```C
|
||||
#define atomic_cmpxchg_acquire atomic_cmpxchg
|
||||
```
|
||||
|
||||
which is architecture specific. We consider [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture, so in our case this header file will be [arch/x86/include/asm/atomic.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/atomic.h) and the implementation of the `atomic_cmpxchg` function is just returns the result of the `cmpxchg` macro:
|
||||
|
||||
```C
|
||||
static __always_inline int atomic_cmpxchg(atomic_t *v, int old, int new)
|
||||
{
|
||||
return cmpxchg(&v->counter, old, new);
|
||||
}
|
||||
```
|
||||
|
||||
This macro is defined in the [arch/x86/include/asm/cmpxchg.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/cmpxchg.h) header file and looks:
|
||||
|
||||
```C
|
||||
#define cmpxchg(ptr, old, new) \
|
||||
__cmpxchg(ptr, old, new, sizeof(*(ptr)))
|
||||
|
||||
#define __cmpxchg(ptr, old, new, size) \
|
||||
__raw_cmpxchg((ptr), (old), (new), (size), LOCK_PREFIX)
|
||||
```
|
||||
|
||||
As we may see, the `cmpxchg` macro expands to the `__cpmxchg` macro with the almost the same set of parameters. New additional parameter is the size of the atomic value. The `__cmpxchg` macro adds `LOCK_PREFIX` and expands to the `__raw_cmpxchg` macro where `LOCK_PREFIX` just [LOCK](http://x86.renejeschke.de/html/file_module_x86_id_159.html) instruction. After all, the `__raw_cmpxchg` does all job for us:
|
||||
|
||||
```C
|
||||
#define __raw_cmpxchg(ptr, old, new, size, lock) \
|
||||
({
|
||||
...
|
||||
...
|
||||
...
|
||||
volatile u32 *__ptr = (volatile u32 *)(ptr); \
|
||||
asm volatile(lock "cmpxchgl %2,%1" \
|
||||
: "=a" (__ret), "+m" (*__ptr) \
|
||||
: "r" (__new), "" (__old) \
|
||||
: "memory"); \
|
||||
...
|
||||
...
|
||||
...
|
||||
})
|
||||
```
|
||||
|
||||
After the `atomic_cmpxchg_acquire` macro will be executed, it returns the previous value of the memory location. Now only one thread tried to acquire a lock, so the `val` will be zero and we will return from the `queued_spin_lock` function:
|
||||
|
||||
```C
|
||||
val = atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL);
|
||||
if (likely(val == 0))
|
||||
return;
|
||||
```
|
||||
|
||||
From this moment, our first thread will hold a lock. Notice that this behavior differs from the behavior which was described in the `MCS` algorithm. The thread acquired lock, but we didn't add it to the `queue`. As I already wrote the implementation of `queued spinlocks` concept is based on the `MCS` algorithm in the Linux kernel, but in the same time it has some difference like this for optimization purpose.
|
||||
|
||||
So the first thread have acquired lock and now let's consider that the second thread tried to acquire the same lock. The second thread will start from the same `queued_spin_lock` function, but the `lock->val` will contain `1` or `_Q_LOCKED_VAL`, because first thread already holds lock. So, in this case the `queued_spin_lock_slowpath` function will be called. The `queued_spin_lock_slowpath` function is defined in the [kernel/locking/qspinlock.c](https://github.com/torvalds/linux/blob/master/kernel/locking/qspinlock.c) source code file and starts from the following checks:
|
||||
|
||||
```C
|
||||
void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
|
||||
{
|
||||
if (pv_enabled())
|
||||
goto queue;
|
||||
|
||||
if (virt_spin_lock(lock))
|
||||
return;
|
||||
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
which check the state of the `pvqspinlock`. The `pvqspinlock` is `queued spinlock` in [paravirtualized](https://en.wikipedia.org/wiki/Paravirtualization) environment. As this chapter is related only to synchronization primitives in the Linux kernel, we skip these and other parts which are not directly related to the topic of this chapter. After these checks we compare our value which represents lock with the value of the `_Q_PENDING_VAL` macro and do nothing while this is true:
|
||||
|
||||
```C
|
||||
if (val == _Q_PENDING_VAL) {
|
||||
while ((val = atomic_read(&lock->val)) == _Q_PENDING_VAL)
|
||||
cpu_relax();
|
||||
}
|
||||
```
|
||||
|
||||
where `cpu_relax` is just [NOP](https://en.wikipedia.org/wiki/NOP) instruction. Above, we saw that the lock contains - `pending` bit. This bit represents thread which wanted to acquire lock, but it is already acquired by the other thread and in the same time `queue` is empty. In this case, the `pending` bit will be set and the `queue` will not be touched. This is done for optimization, because there are no need in unnecessary latency which will be caused by the cache invalidation in a touching of own `mcs_spinlock` array.
|
||||
|
||||
At the next step we enter into the following loop:
|
||||
|
||||
```C
|
||||
for (;;) {
|
||||
if (val & ~_Q_LOCKED_MASK)
|
||||
goto queue;
|
||||
|
||||
new = _Q_LOCKED_VAL;
|
||||
if (val == new)
|
||||
new |= _Q_PENDING_VAL;
|
||||
|
||||
old = atomic_cmpxchg_acquire(&lock->val, val, new);
|
||||
if (old == val)
|
||||
break;
|
||||
|
||||
val = old;
|
||||
}
|
||||
```
|
||||
|
||||
The first `if` clause here checks that state of the lock (`val`) is in locked or pending state. This means that first thread already acquired lock, second thread tried to acquire lock too, but now it is in pending state. In this case we need to start to build queue. We will consider this situation little later. In our case we are first thread holds lock and the second thread tries to do it too. After this check we create new lock in a locked state and compare it with the state of the previous lock. As you remember, the `val` contains state of the `&lock->val` which after the second thread will call the `atomic_cmpxchg_acquire` macro will be equal to `1`. Both `new` and `val` values are equal so we set pending bit in the lock of the second thread. After this we need to check value of the `&lock->val` again, because the first thread may release lock before this moment. If the first thread did not released lock yet, the value of the `old` will be equal to the value of the `val` (because `atomic_cmpxchg_acquire` will return the value from the memory location which is pointed by the `lock->val` and now it is `1`) and we will exit from the loop. As we exited from this loop, we are waiting for the first thread until it will release lock, clear pending bit, acquire lock and return:
|
||||
|
||||
```C
|
||||
smp_cond_acquire(!(atomic_read(&lock->val) & _Q_LOCKED_MASK));
|
||||
clear_pending_set_locked(lock);
|
||||
return;
|
||||
```
|
||||
|
||||
Notice that we did not touch `queue` yet. We no need in it, because for two threads it just leads to unnecessary latency for memory access. In other case, the first thread may release it lock before this moment. In this case the `lock->val` will contain `_Q_LOCKED_VAL | _Q_PENDING_VAL` and we will start to build `queue`. We start to build `queue` by the getting the local copy of the `mcs_nodes` array of the processor which executes thread:
|
||||
|
||||
```C
|
||||
node = this_cpu_ptr(&mcs_nodes[0]);
|
||||
idx = node->count++;
|
||||
tail = encode_tail(smp_processor_id(), idx);
|
||||
```
|
||||
|
||||
Additionally we calculate `tail` which will indicate the tail of the `queue` and `index` which represents an entry of the `mcs_nodes` array. After this we set the `node` to point to the correct of the `mcs_nodes` array, set `locked` to zero because this thread didn't acquire lock yet and `next` to `NULL` because we don't know anything about other `queue` entries:
|
||||
|
||||
```C
|
||||
node += idx;
|
||||
node->locked = 0;
|
||||
node->next = NULL;
|
||||
```
|
||||
|
||||
We already touch `per-cpu` copy of the queue for the processor which executes current thread which wants to acquire lock, this means that owner of the lock may released it before this moment. So we may try to acquire lock again by the call of the `queued_spin_trylock` function.
|
||||
|
||||
```C
|
||||
if (queued_spin_trylock(lock))
|
||||
goto release;
|
||||
```
|
||||
|
||||
The `queued_spin_trylock` function is defined in the [include/asm-generic/qspinlock.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/qspinlock.h) header file and just does the same `queued_spin_lock` function that does:
|
||||
|
||||
```C
|
||||
static __always_inline int queued_spin_trylock(struct qspinlock *lock)
|
||||
{
|
||||
if (!atomic_read(&lock->val) &&
|
||||
(atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) == 0))
|
||||
return 1;
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
If the lock was successfully acquired we jump to the `release` label to release a node of the `queue`:
|
||||
|
||||
```C
|
||||
release:
|
||||
this_cpu_dec(mcs_nodes[0].count);
|
||||
```
|
||||
|
||||
because we no need in it anymore as lock is acquired. If the `queued_spin_trylock` was unsuccessful, we update tail of the queue:
|
||||
|
||||
```C
|
||||
old = xchg_tail(lock, tail);
|
||||
```
|
||||
|
||||
and retrieve previous tail. The next step is to check that `queue` is not empty. In this case we need to link previous entry with the new:
|
||||
|
||||
```C
|
||||
if (old & _Q_TAIL_MASK) {
|
||||
prev = decode_tail(old);
|
||||
WRITE_ONCE(prev->next, node);
|
||||
|
||||
arch_mcs_spin_lock_contended(&node->locked);
|
||||
}
|
||||
```
|
||||
|
||||
After queue entries linked, we start to wait until reaching the head of queue. As we As we reached this, we need to do a check for new node which might be added during this wait:
|
||||
|
||||
```C
|
||||
next = READ_ONCE(node->next);
|
||||
if (next)
|
||||
prefetchw(next);
|
||||
```
|
||||
|
||||
If the new node was added, we prefetch cache line from memory pointed by the next queue entry with the [PREFETCHW](http://www.felixcloutier.com/x86/PREFETCHW.html) instruction. We preload this pointer now for optimization purpose. We just became a head of queue and this means that there is upcoming `MCS` unlock operation and the next entry will be touched.
|
||||
|
||||
Yes, from this moment we are in the head of the `queue`. But before we are able to acquire a lock, we need to wait at least two events: current owner of a lock will release it and the second thread with `pending` bit will acquire a lock too:
|
||||
|
||||
```C
|
||||
smp_cond_acquire(!((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK));
|
||||
```
|
||||
|
||||
After both threads will release a lock, the head of the `queue` will hold a lock. In the end we just need to update the tail of the `queue` and remove current head from it.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the second part of the [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) chapter in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) we already met the first synchronization primitive `spinlock` provided by the Linux kernel which is implemented as `ticket spinlock`. In this part we saw another implementation of the `spinlock` mechanism - `queued spinlock`. In the next part we will continue to dive into synchronization primitives in the Linux kernel.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [spinlock](https://en.wikipedia.org/wiki/Spinlock)
|
||||
* [interrupt](https://en.wikipedia.org/wiki/Interrupt)
|
||||
* [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler)
|
||||
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
* [Test and Set](https://en.wikipedia.org/wiki/Test-and-set)
|
||||
* [MCS](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf)
|
||||
* [per-cpu variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [atomic instruction](https://en.wikipedia.org/wiki/Linearizability)
|
||||
* [CMPXCHG instruction](http://x86.renejeschke.de/html/file_module_x86_id_41.html)
|
||||
* [LOCK instruction](http://x86.renejeschke.de/html/file_module_x86_id_159.html)
|
||||
* [NOP instruction](https://en.wikipedia.org/wiki/NOP)
|
||||
* [PREFETCHW instruction](http://www.felixcloutier.com/x86/PREFETCHW.html)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html)
|
@ -0,0 +1,354 @@
|
||||
Synchronization primitives in the Linux kernel. Part 3.
|
||||
================================================================================
|
||||
|
||||
Semaphores
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the third part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html) which describes synchronization primitives in the Linux kernel and in the previous part we saw special type of [spinlocks](https://en.wikipedia.org/wiki/Spinlock) - `queued spinlocks`. The previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-2.html) was the last part which describes `spinlocks` related stuff. So we need to go ahead.
|
||||
|
||||
The next [synchronization primitive](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) after `spinlock` which we will see in this part is [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29). We will start from theoretical side and will learn what is it `semaphore` and only after this, we will see how it is implemented in the Linux kernel as we did in the previous part.
|
||||
|
||||
So, let's start.
|
||||
|
||||
Introduction to the semaphores in the Linux kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
So, what is it `semaphore`? As you may guess - `semaphore` is yet another mechanism for support of thread or process synchronization. The Linux kernel already provides implementation of one synchronization mechanism - `spinlocks`, why do we need in yet another one? To answer on this question we need to know details of both of these mechanisms. We already familiar with the `spinlocks`, so let's start from this mechanism.
|
||||
|
||||
The main idea behind `spinlock` concept is a lock which will be acquired for a very short time. We can't sleep when a lock acquired by a process or thread, because other processes wait us. [Context switch](https://en.wikipedia.org/wiki/Context_switch) is not not allowed because [preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29) is disabled to avoid [deadlocks](https://en.wikipedia.org/wiki/Deadlock).
|
||||
|
||||
In this way, [semaphores](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) is a good solution for locks which may be acquired for a long time. In other way this mechanism is not optimal for locks that acquired for a short time. To understand this, we need to know what is `semaphore`.
|
||||
|
||||
As usual synchronization primitive, a `semaphore` is based on a variable. This variable may be incremented or decremented and it's state will represent ability to acquire lock. Notice that value of the variable is not limited to `0` and `1`. There are two types of `semaphores`:
|
||||
|
||||
* `binary semaphore`;
|
||||
* `normal semaphore`.
|
||||
|
||||
In the first case, value of `semaphore` may be only `1` or `0`. In the second case value of `semaphore` any non-negative number. If the value of `semaphore` is greater than `1` it is called as `counting semaphore` and it allows to acquire a lock to more than `1` process. This allows us to keep records of available resources, when `spinlock` allows to hold a lock only on one task. Besides all of this, one more important thing that `semaphore` allows to sleep. Moreover when processes waits for a lock which is acquired by other process, the [scheduler](https://en.wikipedia.org/wiki/Scheduling_%28computing%29) may switch on another process.
|
||||
|
||||
Semaphore API
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
So, we know a little about `semaphores` from theoretical side, let's look on its implementation in the Linux kernel. All `semaphore` [API](https://en.wikipedia.org/wiki/Application_programming_interface) is located in the [include/linux/semaphore.h](https://github.com/torvalds/linux/blob/master/include/linux/semaphore.h) header file.
|
||||
|
||||
We may see that the `semaphore` mechanism is represented by the following structure:
|
||||
|
||||
```C
|
||||
struct semaphore {
|
||||
raw_spinlock_t lock;
|
||||
unsigned int count;
|
||||
struct list_head wait_list;
|
||||
};
|
||||
```
|
||||
|
||||
in the Linux kernel. The `semaphore` structure consists of three fields:
|
||||
|
||||
* `lock` - `spinlock` for a `semaphore` data protection;
|
||||
* `count` - amount available resources;
|
||||
* `wait_list` - list of processes which are waiting to acquire a lock.
|
||||
|
||||
Before we will consider an [API](https://en.wikipedia.org/wiki/Application_programming_interface) of the `semaphore` mechanism in the Linux kernel, we need to know how to initialize a `semaphore`. Actually the Linux kernel provides two approaches to execute initialization of the given `semaphore` structure. These methods allows to initialize a `semaphore` in a:
|
||||
|
||||
* `statically`;
|
||||
* `dynamically`.
|
||||
|
||||
ways. Let's look at the first approach. We are able to initialize a `semaphore` statically with the `DEFINE_SEMAPHORE` macro:
|
||||
|
||||
```C
|
||||
#define DEFINE_SEMAPHORE(name) \
|
||||
struct semaphore name = __SEMAPHORE_INITIALIZER(name, 1)
|
||||
```
|
||||
|
||||
as we may see, the `DEFINE_SEMAPHORE` macro provides ability to initialize only `binary` semaphore. The `DEFINE_SEMAPHORE` macro expands to the definition of the `semaphore` structure which is initialized with the `__SEMAPHORE_INITIALIZER` macro. Let's look at the implementation of this macro:
|
||||
|
||||
```C
|
||||
#define __SEMAPHORE_INITIALIZER(name, n) \
|
||||
{ \
|
||||
.lock = __RAW_SPIN_LOCK_UNLOCKED((name).lock), \
|
||||
.count = n, \
|
||||
.wait_list = LIST_HEAD_INIT((name).wait_list), \
|
||||
}
|
||||
```
|
||||
|
||||
The `__SEMAPHORE_INITIALIZER` macro takes the name of the future `semaphore` structure and does initialization of the fields of this structure. First of all we initialize a `spinlock` of the given `semaphore` with the `__RAW_SPIN_LOCK_UNLOCKED` macro. As you may remember from the [previous](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) parts, the `__RAW_SPIN_LOCK_UNLOCKED` is defined in the [include/linux/spinlock_types.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock_types.h) header file and expands to the `__ARCH_SPIN_LOCK_UNLOCKED` macro which just expands to zero or unlocked state:
|
||||
|
||||
```C
|
||||
#define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } }
|
||||
```
|
||||
|
||||
The last two fields of the `semaphore` structure `count` and `wait_list` are initialized with the given value which represents count of available resources and empty [list](https://0xax.gitbooks.io/linux-insides/content/DataStructures/dlist.html).
|
||||
|
||||
The second way to initialize a `semaphore` structure is to pass the `semaphore` and number of available resources to the `sema_init` function which is defined in the [include/linux/semaphore.h](https://github.com/torvalds/linux/blob/master/include/linux/semaphore.h) header file:
|
||||
|
||||
```C
|
||||
static inline void sema_init(struct semaphore *sem, int val)
|
||||
{
|
||||
static struct lock_class_key __key;
|
||||
*sem = (struct semaphore) __SEMAPHORE_INITIALIZER(*sem, val);
|
||||
lockdep_init_map(&sem->lock.dep_map, "semaphore->lock", &__key, 0);
|
||||
}
|
||||
```
|
||||
|
||||
Let's consider implementation of this function. It looks pretty easy and actually it does almost the same. Thus function executes initialization of the given `semaphore` with the `__SEMAPHORE_INITIALIZER` macro which we just saw. As I already wrote in the previous parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html), we will skip the stuff which is related to the [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) of the Linux kernel.
|
||||
|
||||
So, from now we are able to initialize a `semaphore` let's look at how to lock and unlock. The Linux kernel provides following [API](https://en.wikipedia.org/wiki/Application_programming_interface) to manipulate `semaphores`:
|
||||
|
||||
```
|
||||
void down(struct semaphore *sem);
|
||||
void up(struct semaphore *sem);
|
||||
int down_interruptible(struct semaphore *sem);
|
||||
int down_killable(struct semaphore *sem);
|
||||
int down_trylock(struct semaphore *sem);
|
||||
int down_timeout(struct semaphore *sem, long jiffies);
|
||||
```
|
||||
|
||||
The first two functions: `down` and `up` are for acquiring and releasing of the given `semaphore`. The `down_interruptible` function tries to acquire a `semaphore`. If this try was successful, the `count` field of the given `semaphore` will be decremented and lock will be acquired, in other way the task will be switched to the blocked state or in other words the `TASK_INTERRUPTIBLE` flag will be set. This `TASK_INTERRUPTIBLE` flag means that the process may returned to ruined state by [signal](https://en.wikipedia.org/wiki/Unix_signal).
|
||||
|
||||
The `down_killable` function does the same as the `down_interruptible` function, but set the `TASK_KILLABLE` flag for the current process. This means that the waiting process may be interrupted by the kill signal.
|
||||
|
||||
The `down_trylock` function is similar on the `spin_trylock` function. This function tries to acquire a lock and exit if this operation was unsuccessful. In this case the process which wants to acquire a lock, will not wait. The last `down_timeout` function tries to acquire a lock. It will be interrupted in a waiting state when the given timeout will be expired. Additionally, you may notice that the timeout is in [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html)
|
||||
|
||||
We just saw definitions of the `semaphore` [API](https://en.wikipedia.org/wiki/Application_programming_interface). We will start from the `down` function. This function is defined in the [kernel/locking/semaphore.c](https://github.com/torvalds/linux/blob/master/kernel/locking/semaphore.c) source code file. Let's look on the implementation function:
|
||||
|
||||
```C
|
||||
void down(struct semaphore *sem)
|
||||
{
|
||||
unsigned long flags;
|
||||
|
||||
raw_spin_lock_irqsave(&sem->lock, flags);
|
||||
if (likely(sem->count > 0))
|
||||
sem->count--;
|
||||
else
|
||||
__down(sem);
|
||||
raw_spin_unlock_irqrestore(&sem->lock, flags);
|
||||
}
|
||||
EXPORT_SYMBOL(down);
|
||||
```
|
||||
|
||||
We may see the definition of the `flags` variable at the beginning of the `down` function. This variable will be passed to the `raw_spin_lock_irqsave` and `raw_spin_lock_irqrestore` macros which are defined in the [include/linux/spinlock.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock.h) header file and protect a counter of the given `semaphore` here. Actually both of these macro do the same that `spin_lock` and `spin_unlock` macros, but additionally they save/restore current value of interrupt flags and disables [interrupts](https://en.wikipedia.org/wiki/Interrupt).
|
||||
|
||||
As you already may guess, the main work is done between the `raw_spin_lock_irqsave` and `raw_spin_unlock_irqrestore` macros in the `down` function. We compare the value of the `semaphore` counter with zero and if it is bigger than zero, we may decrement this counter. This means that we already acquired the lock. In other way counter is zero. This means that all available resources already finished and we need to wait to acquire this lock. As we may see, the `__down` function will be called in this case.
|
||||
|
||||
The `__down` function is defined in the [same](https://github.com/torvalds/linux/blob/master/kernel/locking/semaphore.c) source code file and its implementation looks:
|
||||
|
||||
```C
|
||||
static noinline void __sched __down(struct semaphore *sem)
|
||||
{
|
||||
__down_common(sem, TASK_UNINTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
|
||||
}
|
||||
```
|
||||
|
||||
The `__down` function just calls the `__down_common` function with three parameters:
|
||||
|
||||
* `semaphore`;
|
||||
* `flag` - for the task;
|
||||
* `timeout` - maximum timeout to wait `semaphore`.
|
||||
|
||||
Before we will consider implementation of the `__down_common` function, notice that implementation of the `down_trylock`, `down_timeout` and `down_killable` functions based on the `__down_common` too:
|
||||
|
||||
```C
|
||||
static noinline int __sched __down_interruptible(struct semaphore *sem)
|
||||
{
|
||||
return __down_common(sem, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
|
||||
}
|
||||
```
|
||||
|
||||
The `__down_killable`:
|
||||
|
||||
```C
|
||||
static noinline int __sched __down_killable(struct semaphore *sem)
|
||||
{
|
||||
return __down_common(sem, TASK_KILLABLE, MAX_SCHEDULE_TIMEOUT);
|
||||
}
|
||||
```
|
||||
|
||||
And the `__down_timeout`:
|
||||
|
||||
```C
|
||||
static noinline int __sched __down_timeout(struct semaphore *sem, long timeout)
|
||||
{
|
||||
return __down_common(sem, TASK_UNINTERRUPTIBLE, timeout);
|
||||
}
|
||||
```
|
||||
|
||||
Now let's look at the implementation of the `__down_common` function. This function is defined in the [kernel/locking/semaphore.c](https://github.com/torvalds/linux/blob/master/kernel/locking/semaphore.c) source code file too and starts from the definition of the two following local variables:
|
||||
|
||||
```C
|
||||
struct task_struct *task = current;
|
||||
struct semaphore_waiter waiter;
|
||||
```
|
||||
|
||||
The first represents current task for the local processor which wants to acquire a lock. The `current` is a macro which is defined in the [arch/x86/include/asm/current.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/current.h) header file:
|
||||
|
||||
```C
|
||||
#define current get_current()
|
||||
```
|
||||
|
||||
Where the `get_current` function returns value of the `current_task` [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable:
|
||||
|
||||
```C
|
||||
DECLARE_PER_CPU(struct task_struct *, current_task);
|
||||
|
||||
static __always_inline struct task_struct *get_current(void)
|
||||
{
|
||||
return this_cpu_read_stable(current_task);
|
||||
}
|
||||
```
|
||||
|
||||
The second variable is `waiter` represents an entry of a `semaphore.wait_list` list:
|
||||
|
||||
```C
|
||||
struct semaphore_waiter {
|
||||
struct list_head list;
|
||||
struct task_struct *task;
|
||||
bool up;
|
||||
};
|
||||
```
|
||||
|
||||
Next we add current task to the `wait_list` and fill `waiter` fields after definition of these variables:
|
||||
|
||||
```C
|
||||
list_add_tail(&waiter.list, &sem->wait_list);
|
||||
waiter.task = task;
|
||||
waiter.up = false;
|
||||
```
|
||||
|
||||
In the next step we join into the following infinite loop:
|
||||
|
||||
```C
|
||||
for (;;) {
|
||||
if (signal_pending_state(state, task))
|
||||
goto interrupted;
|
||||
|
||||
if (unlikely(timeout <= 0))
|
||||
goto timed_out;
|
||||
|
||||
__set_task_state(task, state);
|
||||
|
||||
raw_spin_unlock_irq(&sem->lock);
|
||||
timeout = schedule_timeout(timeout);
|
||||
raw_spin_lock_irq(&sem->lock);
|
||||
|
||||
if (waiter.up)
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
In the previous piece of code we set `waiter.up` to `false`. So, a task will spin in this loop while `up` will not be set to `true`. This loop starts from the check that the current task is in the `pending` state or in other words flags of this task contains `TASK_INTERRUPTIBLE` or `TASK_WAKEKILL` flag. As I already wrote above a task may be interrupted by [signal](https://en.wikipedia.org/wiki/Unix_signal) during wait of ability to acquire a lock. The `signal_pending_state` function is defined in the [include/linux/sched.h](https://github.com/torvalds/linux/blob/master/include/linux/sched.h) source code file and looks:
|
||||
|
||||
```C
|
||||
static inline int signal_pending_state(long state, struct task_struct *p)
|
||||
{
|
||||
if (!(state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)))
|
||||
return 0;
|
||||
if (!signal_pending(p))
|
||||
return 0;
|
||||
|
||||
return (state & TASK_INTERRUPTIBLE) || __fatal_signal_pending(p);
|
||||
}
|
||||
```
|
||||
|
||||
We check that the `state` [bitmask](https://en.wikipedia.org/wiki/Mask_%28computing%29) contains `TASK_INTERRUPTIBLE` or `TASK_WAKEKILL` bits and if the bitmask does not contain this bit we exit. At the next step we check that the given task has a pending signal and exit if there is no. In the end we just check `TASK_INTERRUPTIBLE` bit in the `state` bitmask again or the [SIGKILL](https://en.wikipedia.org/wiki/Unix_signal#SIGKILL) signal. So, if our task has a pending signal, we will jump at the `interrupted` label:
|
||||
|
||||
```C
|
||||
interrupted:
|
||||
list_del(&waiter.list);
|
||||
return -EINTR;
|
||||
```
|
||||
|
||||
where we delete task from the list of lock waiters and return the `-EINTR` [error code](https://en.wikipedia.org/wiki/Errno.h). If a task has no pending signal, we check the given timeout and if it is less or equal zero:
|
||||
|
||||
```C
|
||||
if (unlikely(timeout <= 0))
|
||||
goto timed_out;
|
||||
```
|
||||
|
||||
we jump at the `timed_out` label:
|
||||
|
||||
```C
|
||||
timed_out:
|
||||
list_del(&waiter.list);
|
||||
return -ETIME;
|
||||
```
|
||||
|
||||
Where we do almost the same that we did in the `interrupted` label. We delete task from the list of lock waiters, but return the `-ETIME` error code. If a task has no pending signal and the given timeout is not expired yet, the given `state` will be set in the given task:
|
||||
|
||||
```C
|
||||
__set_task_state(task, state);
|
||||
```
|
||||
|
||||
and call the `schedule_timeout` function:
|
||||
|
||||
```C
|
||||
raw_spin_unlock_irq(&sem->lock);
|
||||
timeout = schedule_timeout(timeout);
|
||||
raw_spin_lock_irq(&sem->lock);
|
||||
```
|
||||
|
||||
which is defined in the [kernel/time/timer.c](https://github.com/torvalds/linux/blob/master/kernel/time/timer.c) source code file. The `schedule_timeout` function makes the current task sleep until the given timeout.
|
||||
|
||||
That is all about the `__down_common` function. A task which wants to acquire a lock which is already acquired by another task will be spun in the infinite loop while it will not be interrupted by a signal, the given timeout will not be expired or the task which holds a lock will not release it. Now let's look at the implementation of the `up` function.
|
||||
|
||||
The `up` function is defined in the [same](https://github.com/torvalds/linux/blob/master/kernel/locking/semaphore.c) source code file as `down` function. As we already know, the main purpose of this function is to release a lock. This function looks:
|
||||
|
||||
```C
|
||||
void up(struct semaphore *sem)
|
||||
{
|
||||
unsigned long flags;
|
||||
|
||||
raw_spin_lock_irqsave(&sem->lock, flags);
|
||||
if (likely(list_empty(&sem->wait_list)))
|
||||
sem->count++;
|
||||
else
|
||||
__up(sem);
|
||||
raw_spin_unlock_irqrestore(&sem->lock, flags);
|
||||
}
|
||||
EXPORT_SYMBOL(up);
|
||||
```
|
||||
|
||||
It looks almost the same as the `down` function. There are only two differences here. First of all we increment a counter of a `semaphore` if the list of waiters is empty. In other way we call the `__up` function from the same source code file. If the list of waiters is not empty we need to allow the first task from the list to acquire a lock:
|
||||
|
||||
```C
|
||||
static noinline void __sched __up(struct semaphore *sem)
|
||||
{
|
||||
struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
|
||||
struct semaphore_waiter, list);
|
||||
list_del(&waiter->list);
|
||||
waiter->up = true;
|
||||
wake_up_process(waiter->task);
|
||||
}
|
||||
```
|
||||
|
||||
Here we takes the first task from the list of waiters, delete it from the list, set its `waiter-up` to true. From this point the infinite loop from the `__down_common` function will be stopped. The `wake_up_process` function will be called in the end of the `__up` function. As you remember we called the `schedule_timeout` function in the infinite loop from the `__down_common` this function. The `schedule_timeout` function makes the current task sleep until the given timeout will not be expired. So, as our process may sleep right now, we need to wake it up. That's why we call the `wake_up_process` function from the [kernel/sched/core.c](https://github.com/torvalds/linux/blob/master/kernel/sched/core.c) source code file.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the third part of the [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) chapter in the Linux kernel. In the two previous parts we already met the first synchronization primitive `spinlock` provided by the Linux kernel which is implemented as `ticket spinlock` and used for a very short time locks. In this part we saw yet another synchronization primitive - [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) which is used for long time locks as it leads to [context switch](https://en.wikipedia.org/wiki/Context_switch). In the next part we will continue to dive into synchronization primitives in the Linux kernel and will see next synchronization primitive - [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion).
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [spinlocks](https://en.wikipedia.org/wiki/Spinlock)
|
||||
* [synchronization primitive](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29)
|
||||
* [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29)
|
||||
* [context switch](https://en.wikipedia.org/wiki/Context_switch)
|
||||
* [preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29)
|
||||
* [deadlocks](https://en.wikipedia.org/wiki/Deadlock)
|
||||
* [scheduler](https://en.wikipedia.org/wiki/Scheduling_%28computing%29)
|
||||
* [Doubly linked list in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/DataStructures/dlist.html)
|
||||
* [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html)
|
||||
* [interrupts](https://en.wikipedia.org/wiki/Interrupt)
|
||||
* [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [bitmask](https://en.wikipedia.org/wiki/Mask_%28computing%29)
|
||||
* [SIGKILL](https://en.wikipedia.org/wiki/Unix_signal#SIGKILL)
|
||||
* [errno](https://en.wikipedia.org/wiki/Errno.h)
|
||||
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
* [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-2.html)
|
@ -0,0 +1,440 @@
|
||||
Synchronization primitives in the Linux kernel. Part 4.
|
||||
================================================================================
|
||||
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the fourth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html) which describes synchronization primitives in the Linux kernel and in the previous parts we finished to consider different types [spinlocks](https://en.wikipedia.org/wiki/Spinlock) and [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) synchronization primitives. We will continue to learn [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) in this part and consider yet another one which is called - [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) which is stands for `MUTual EXclusion`.
|
||||
|
||||
As in all previous parts of this [book](https://0xax.gitbooks.io/linux-insides/content), we will try to consider this synchronization primitive from the theoretical side and only than we will consider [API](https://en.wikipedia.org/wiki/Application_programming_interface) provided by the Linux kernel to manipulate with `mutexes`.
|
||||
|
||||
So, let's start.
|
||||
|
||||
Concept of `mutex`
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
We already familiar with the [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) synchronization primitive from the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-3.html). It represented by the:
|
||||
|
||||
```C
|
||||
struct semaphore {
|
||||
raw_spinlock_t lock;
|
||||
unsigned int count;
|
||||
struct list_head wait_list;
|
||||
};
|
||||
```
|
||||
|
||||
structure which holds information about state of a [lock](https://en.wikipedia.org/wiki/Lock_%28computer_science%29) and list of a lock waiters. Depends on the value of the `count` field, a `semaphore` can provide access to a resource of more than one wishing of this resource. The [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion) concept is very similar to a [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) concept. But it has some differences. The main difference between `semaphore` and `mutex` synchronization primitive is that `mutex` has more strict semantic. Unlike a `semaphore`, only one [process](https://en.wikipedia.org/wiki/Process_%28computing%29) may hold `mutex` at one time and only the `owner` of a `mutex` may release or unlock it. Additional difference in implementation of `lock` [API](https://en.wikipedia.org/wiki/Application_programming_interface). The `semaphore` synchronization primitive forces rescheduling of processes which are in waiters list. The implementation of `mutex` lock `API` allows to avoid this situation and as a result expensive [context switches](https://en.wikipedia.org/wiki/Context_switch).
|
||||
|
||||
The `mutex` synchronization primitive represented by the following:
|
||||
|
||||
```C
|
||||
struct mutex {
|
||||
atomic_t count;
|
||||
spinlock_t wait_lock;
|
||||
struct list_head wait_list;
|
||||
#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_MUTEX_SPIN_ON_OWNER)
|
||||
struct task_struct *owner;
|
||||
#endif
|
||||
#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
|
||||
struct optimistic_spin_queue osq;
|
||||
#endif
|
||||
#ifdef CONFIG_DEBUG_MUTEXES
|
||||
void *magic;
|
||||
#endif
|
||||
#ifdef CONFIG_DEBUG_LOCK_ALLOC
|
||||
struct lockdep_map dep_map;
|
||||
#endif
|
||||
};
|
||||
```
|
||||
|
||||
structure in the Linux kernel. This structure is defined in the [include/linux/mutex.h](https://github.com/torvalds/linux/blob/master/include/linux/mutex.h) header file and contains similar to the `semaphore` structure set of fields. The first field of the `mutex` structure is - `count`. Value of this field represents state of a `mutex`. In a case when the value of the `count` field is `1`, a `mutex` is in `unlocked` state. When the value of the `count` field is `zero`, a `mutex` is in the `locked` state. Additionally value of the `count` field may be `negative`. In this case a `mutex` is in the `locked` state and has possible waiters.
|
||||
|
||||
The next two fields of the `mutex` structure - `wait_lock` and `wait_list` are [spinlock](https://github.com/torvalds/linux/blob/master/include/linux/mutex.h) for the protection of a `wait queue` and list of waiters which represents this `wait queue` for a certain lock. As you may notice, the similarity of the `mutex` and `semaphore` structures ends. Remaining fields of the `mutex` structure, as we may see depends on different configuration options of the Linux kernel.
|
||||
|
||||
The first field - `owner` represents [process](https://en.wikipedia.org/wiki/Process_%28computing%29) which acquired a lock. As we may see, existence of this field in the `mutex` structure depends on the `CONFIG_DEBUG_MUTEXES` or `CONFIG_MUTEX_SPIN_ON_OWNER` kernel configuration options. Main point of this field and the next `osq` fields is support of `optimistic spinning` which we will see later. The last two fields - `magic` and `dep_map` are used only in [debugging](https://en.wikipedia.org/wiki/Debugging) mode. The `magic` field is to storing a `mutex` related information for debugging and the second field - `lockdep_map` is for [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) of the Linux kernel.
|
||||
|
||||
Now, after we have considered the `mutex` structure, we may consider how this synchronization primitive works in the Linux kernel. As you may guess, a process which wants to acquire a lock, must to decrease value of the `mutex->count` if possible. And if a process wants to release a lock, it must to increase the same value. That's true. But as you may also guess, it is not so simple in the Linux kernel.
|
||||
|
||||
Actually, when a process try to acquire a `mutex`, there three possible paths:
|
||||
|
||||
* `fastpath`;
|
||||
* `midpath`;
|
||||
* `slowpath`.
|
||||
|
||||
which may be taken, depending on the current state of the `mutex`. The first path or `fastpath` is the fastest as you may understand from its name. Everything is easy in this case. Nobody acquired a `mutex`, so the value of the `count` field of the `mutex` structure may be directly decremented. In a case of unlocking of a `mutex`, the algorithm is the same. A process just increments the value of the `count` field of the `mutex` structure. Of course, all of these operations must be [atomic](https://en.wikipedia.org/wiki/Linearizability).
|
||||
|
||||
Yes, this looks pretty easy. But what happens if a process wants to acquire a `mutex` which is already acquired by other process? In this case, the control will be transferred to the second path - `midpath`. The `midpath` or `optimistic spinning` tries to [spin](https://en.wikipedia.org/wiki/Spinlock) with already familiar for us [MCS lock](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf) while the lock owner is running. This path will be executed only if there are no other processes ready to run that have higher priority. This path is called `optimistic` because the waiting task will not be sleep and rescheduled. This allows to avoid expensive [context switch](https://en.wikipedia.org/wiki/Context_switch).
|
||||
|
||||
In the last case, when the `fastpath` and `midpath` may not be executed, the last path - `slowpath` will be executed. This path acts like a [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29) lock. If the lock is unable to be acquired by a process, this process will be added to `wait queue` which is represented by the following:
|
||||
|
||||
```C
|
||||
struct mutex_waiter {
|
||||
struct list_head list;
|
||||
struct task_struct *task;
|
||||
#ifdef CONFIG_DEBUG_MUTEXES
|
||||
void *magic;
|
||||
#endif
|
||||
};
|
||||
```
|
||||
|
||||
structure from the [include/linux/mutex.h](https://github.com/torvalds/linux/blob/master/include/linux/mutex.h) header file and will be sleep. Before we will consider [API](https://en.wikipedia.org/wiki/Application_programming_interface) which is provided by the Linux kernel for manipulation with `mutexes`, let's consider the `mutex_waiter` structure. If you have read the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-3.html) of this chapter, you may notice that the `mutex_waiter` structure is similar to the `semaphore_waiter` structure from the [kernel/locking/semaphore.c](https://github.com/torvalds/linux/blob/master/kernel/locking/semaphore.c) source code file:
|
||||
|
||||
```C
|
||||
struct semaphore_waiter {
|
||||
struct list_head list;
|
||||
struct task_struct *task;
|
||||
bool up;
|
||||
};
|
||||
```
|
||||
|
||||
It also contains `list` and `task` fields which are represent entry of the mutex wait queue. The one difference here that the `mutex_waiter` does not contains `up` field, but contains the `magic` field which depends on the `CONFIG_DEBUG_MUTEXES` kernel configuration option and used to store a `mutex` related information for debugging purpose.
|
||||
|
||||
Now we know what is it `mutex` and how it is represented the Linux kernel. In this case, we may go ahead and start to look at the [API](https://en.wikipedia.org/wiki/Application_programming_interface) which the Linux kernel provides for manipulation of `mutexes`.
|
||||
|
||||
Mutex API
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Ok, in the previous paragraph we knew what is it `mutex` synchronization primitive and saw the `mutex` structure which represents `mutex` in the Linux kernel. Now it's time to consider [API](https://en.wikipedia.org/wiki/Application_programming_interface) for manipulation of mutexes. Description of the `mutex` API is located in the [include/linux/mutex.h](https://github.com/torvalds/linux/blob/master/include/linux/mutex.h) header file. As always, before we will consider how to acquire and release a `mutex`, we need to know how to initialize it.
|
||||
|
||||
There are two approaches to initialize a `mutex`. The first is to do it statically. For this purpose the Linux kernel provides following:
|
||||
|
||||
```C
|
||||
#define DEFINE_MUTEX(mutexname) \
|
||||
struct mutex mutexname = __MUTEX_INITIALIZER(mutexname)
|
||||
```
|
||||
|
||||
macro. Let's consider implementation of this macro. As we may see, the `DEFINE_MUTEX` macro takes name for the `mutex` and expands to the definition of the new `mutex` structure. Additionally new `mutex` structure get initialized with the `__MUTEX_INITIALIZER` macro. Let's look at the implementation of the `__MUTEX_INITIALIZER`:
|
||||
|
||||
```C
|
||||
#define __MUTEX_INITIALIZER(lockname) \
|
||||
{ \
|
||||
.count = ATOMIC_INIT(1), \
|
||||
.wait_lock = __SPIN_LOCK_UNLOCKED(lockname.wait_lock), \
|
||||
.wait_list = LIST_HEAD_INIT(lockname.wait_list) \
|
||||
}
|
||||
```
|
||||
|
||||
This macro is defined in the [same](https://github.com/torvalds/linux/blob/master/include/linux/mutex.h) header file and as we may understand it initializes fields of the `mutex` structure the initial values. The `count` field get initialized with the `1` which represents `unlocked` state of a mutex. The `wait_lock` [spinlock](https://en.wikipedia.org/wiki/Spinlock) get initialized to the unlocked state and the last field `wait_list` to empty [doubly linked list](https://0xax.gitbooks.io/linux-insides/content/DataStructures/dlist.html).
|
||||
|
||||
The second approach allows us to initialize a `mutex` dynamically. To do this we need to call the `__mutex_init` function from the [kernel/locking/mutex.c](https://github.com/torvalds/linux/blob/master/kernel/locking/mutex.c) source code file. Actually, the `__mutex_init` function rarely called directly. Instead of the `__mutex_init`, the:
|
||||
|
||||
```C
|
||||
# define mutex_init(mutex) \
|
||||
do { \
|
||||
static struct lock_class_key __key; \
|
||||
\
|
||||
__mutex_init((mutex), #mutex, &__key); \
|
||||
} while (0)
|
||||
```
|
||||
|
||||
macro is used. We may see that the `mutex_init` macro just defines the `lock_class_key` and call the `__mutex_init` function. Let's look at the implementation of this function:
|
||||
|
||||
```C
|
||||
void
|
||||
__mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
|
||||
{
|
||||
atomic_set(&lock->count, 1);
|
||||
spin_lock_init(&lock->wait_lock);
|
||||
INIT_LIST_HEAD(&lock->wait_list);
|
||||
mutex_clear_owner(lock);
|
||||
#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
|
||||
osq_lock_init(&lock->osq);
|
||||
#endif
|
||||
debug_mutex_init(lock, name, key);
|
||||
}
|
||||
```
|
||||
|
||||
As we may see the `__mutex_init` function takes three arguments:
|
||||
|
||||
* `lock` - a mutex itself;
|
||||
* `name` - name of mutex for debugging purpose;
|
||||
* `key` - key for [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt).
|
||||
|
||||
At the beginning of the `__mutex_init` function, we may see initialization of the `mutex` state. We set it to `unlocked` state with the `atomic_set` function which atomically set the give variable to the given value. After this we may see initialization of the `spinlock` to the unlocked state which will protect `wait queue` of the `mutex` and initialization of the `wait queue` of the `mutex`. After this we clear owner of the `lock` and initialize optimistic queue by the call of the `osq_lock_init` function from the [include/linux/osq_lock.h](https://github.com/torvalds/linux/blob/master/include/linux/osq_lock.h) header file. This function just sets the tail of the optimistic queue to the unlocked state:
|
||||
|
||||
```C
|
||||
static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
|
||||
{
|
||||
return atomic_read(&lock->tail) != OSQ_UNLOCKED_VAL;
|
||||
}
|
||||
```
|
||||
|
||||
In the end of the `__mutex_init` function we may see the call of the `debug_mutex_init` function, but as I already wrote in previous parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/index.html), we will not consider debugging related stuff in this chapter.
|
||||
|
||||
After the `mutex` structure is initialized, we may go ahead and will look at the `lock` and `unlock` API of `mutex` synchronization primitive. Implementation of `mutex_lock` and `mutex_unlock` functions located in the [kernel/locking/mutex.c](https://github.com/torvalds/linux/blob/master/kernel/locking/mutex.c) source code file. First of all let's start from the implementation of the `mutex_lock`. It looks:
|
||||
|
||||
```C
|
||||
void __sched mutex_lock(struct mutex *lock)
|
||||
{
|
||||
might_sleep();
|
||||
__mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath);
|
||||
mutex_set_owner(lock);
|
||||
}
|
||||
```
|
||||
|
||||
We may see the call of the `might_sleep` macro from the [include/linux/kernel.h](https://github.com/torvalds/linux/blob/master/include/linux/kernel.h) header file at the beginning of the `mutex_lock` function. Implementation of this macro depends on the `CONFIG_DEBUG_ATOMIC_SLEEP` kernel configuration option and if this option is enabled, this macro just prints a stack trace if it was executed in [atomic](https://en.wikipedia.org/wiki/Linearizability) context. This macro is helper for debugging purposes. In other way this macro does nothing.
|
||||
|
||||
After the `might_sleep` macro, we may see the call of the `__mutex_fastpath_lock` function. This function is architecture-specific and as we consider [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture in this book, the implementation of the `__mutex_fastpath_lock` is located in the [arch/x86/include/asm/mutex_64.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/mutex_64.h) header file. As we may understand from the name of the `__mutex_fastpath_lock` function, this function will try to acquire lock in a fast path or in other words this function will try to decrement the value of the `count` of the given mutex.
|
||||
|
||||
Implementation of the `__mutex_fastpath_lock` function consists from two parts. The first part is [inline assembly](https://0xax.gitbooks.io/linux-insides/content/Theory/asm.html) statement. Let's look at it:
|
||||
|
||||
```C
|
||||
asm_volatile_goto(LOCK_PREFIX " decl %0\n"
|
||||
" jns %l[exit]\n"
|
||||
: : "m" (v->counter)
|
||||
: "memory", "cc"
|
||||
: exit);
|
||||
```
|
||||
|
||||
First of all, let's pay attention to the `asm_volatile_goto`. This macro is defined in the [include/linux/compiler-gcc.h](https://github.com/torvalds/linux/blob/master/include/linux/compiler-gcc.h) header file and just expands to the two inline assembly statements:
|
||||
|
||||
```C
|
||||
#define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)
|
||||
```
|
||||
|
||||
The first assembly statement contains `goto` specificator and the second empty inline assembly statement is [barrier](https://en.wikipedia.org/wiki/Memory_barrier). Now let's return to the our inline assembly statement. As we may see it starts from the definition of the `LOCK_PREFIX` macro which just expands to the [lock](http://x86.renejeschke.de/html/file_module_x86_id_159.html) instruction:
|
||||
|
||||
```C
|
||||
#define LOCK_PREFIX LOCK_PREFIX_HERE "\n\tlock; "
|
||||
```
|
||||
|
||||
As we already know from the previous parts, this instruction allows to execute prefixed instruction [atomically](https://en.wikipedia.org/wiki/Linearizability). So, at the first step in the our assembly statement we try decrement value of the given `mutex->counter`. At the next step the [jns](http://unixwiz.net/techtips/x86-jumps.html) instruction will execute jump at the `exit` label if the value of the decremented `mutex->counter` is not negative. The `exit` label is the second part of the `__mutex_fastpath_lock` function and it just points to the exit from this function:
|
||||
|
||||
```C
|
||||
exit:
|
||||
return;
|
||||
```
|
||||
|
||||
For this moment he implementation of the `__mutex_fastpath_lock` function looks pretty easy. But the value of the `mutex->counter` may be negative after increment. In this case the:
|
||||
|
||||
```C
|
||||
fail_fn(v);
|
||||
```
|
||||
|
||||
will be called after our inline assembly statement. The `fail_fn` is the second parameter of the `__mutex_fastpath_lock` function and represents pointer to function which represents `midpath/slowpath` paths to acquire the given lock. In our case the `fail_fn` is the `__mutex_lock_slowpath` function. Before we will look at the implementation of the `__mutex_lock_slowpath` function, let's finish with the implementation of the `mutex_lock` function. In the simplest way, the lock will be acquired successfully by a process and the `__mutex_fastpath_lock` will be finished. In this case, we just call the
|
||||
|
||||
```C
|
||||
mutex_set_owner(lock);
|
||||
```
|
||||
|
||||
in the end of the `mutex_lock`. The `mutex_set_owner` function is defined in the [kernel/locking/mutex.h](https://github.com/torvalds/linux/blob/master/include/linux/mutex.h) header file and just sets owner of a lock to the current process:
|
||||
|
||||
```C
|
||||
static inline void mutex_set_owner(struct mutex *lock)
|
||||
{
|
||||
lock->owner = current;
|
||||
}
|
||||
```
|
||||
|
||||
In other way, let's consider situation when a process which wants to acquire a lock is unable to do it, because another process already acquired the same lock. We already know that the `__mutex_lock_slowpath` function will be called in this case. Let's consider implementation of this function. This function is defined in the [kernel/locking/mutex.c](https://github.com/torvalds/linux/blob/master/kernel/locking/mutex.c) source code file and starts from the obtaining of the proper mutex by the mutex state given from the `__mutex_fastpath_lock` with the `container_of` macro:
|
||||
|
||||
```C
|
||||
__visible void __sched
|
||||
__mutex_lock_slowpath(atomic_t *lock_count)
|
||||
{
|
||||
struct mutex *lock = container_of(lock_count, struct mutex, count);
|
||||
|
||||
__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0,
|
||||
NULL, _RET_IP_, NULL, 0);
|
||||
}
|
||||
```
|
||||
|
||||
and call the `__mutex_lock_common` function with the obtained `mutex`. The `__mutex_lock_common` function starts from [preemtion](https://en.wikipedia.org/wiki/Preemption_%28computing%29) disabling until rescheduling:
|
||||
|
||||
```C
|
||||
preempt_disable();
|
||||
```
|
||||
|
||||
After this comes the stage of optimistic spinning. As we already know this stage depends on the `CONFIG_MUTEX_SPIN_ON_OWNER` kernel configuration option. If this option is disabled, we skip this stage and move at the last path - `slowpath` of a `mutex` acquisition:
|
||||
|
||||
```C
|
||||
if (mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx)) {
|
||||
preempt_enable();
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
First of all the `mutex_optimistic_spin` function check that we don't need to reschedule or in other words there are no other tasks ready to run that have higher priority. If this check was successful we need to update `MCS` lock wait queue with the current spin. In this way only one spinner can complete for the mutex at one time:
|
||||
|
||||
```C
|
||||
osq_lock(&lock->osq)
|
||||
```
|
||||
|
||||
At the next step we start to spin in the next loop:
|
||||
|
||||
```C
|
||||
while (true) {
|
||||
owner = READ_ONCE(lock->owner);
|
||||
|
||||
if (owner && !mutex_spin_on_owner(lock, owner))
|
||||
break;
|
||||
|
||||
if (mutex_try_to_acquire(lock)) {
|
||||
lock_acquired(&lock->dep_map, ip);
|
||||
|
||||
mutex_set_owner(lock);
|
||||
osq_unlock(&lock->osq);
|
||||
return true;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
and try to acquire a lock. First of all we try to take current owner and if the owner exists (it may not exists in a case when a process already released a mutex) and we wait for it in the `mutex_spin_on_owner` function before the owner will release a lock. If new task with higher priority have appeared during wait of the lock owner, we break the loop and go to sleep. In other case, the process already may release a lock, so we try to acquire a lock with the `mutex_try_to_acquired`. If this operation finished successfully, we set new owner for the given mutex, removes ourself from the `MCS` wait queue and exit from the `mutex_optimistic_spin` function. At this state a lock will be acquired by a process and we enable [preemtion](https://en.wikipedia.org/wiki/Preemption_%28computing%29) and exit from the `__mutex_lock_common` function:
|
||||
|
||||
```C
|
||||
if (mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx)) {
|
||||
preempt_enable();
|
||||
return 0;
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
That's all for this case.
|
||||
|
||||
In other case all may not be so successful. For example new task may occur during we spinning in the loop from the `mutex_optimistic_spin` or even we may not get to this loop from the `mutex_optimistic_spin` in a case when there were task(s) with higher priority before this loop. Or finally the `CONFIG_MUTEX_SPIN_ON_OWNER` kernel configuration option disabled. In this case the `mutex_optimistic_spin` will do nothing:
|
||||
|
||||
```C
|
||||
#ifndef CONFIG_MUTEX_SPIN_ON_OWNER
|
||||
static bool mutex_optimistic_spin(struct mutex *lock,
|
||||
struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
|
||||
{
|
||||
return false;
|
||||
}
|
||||
#endif
|
||||
```
|
||||
|
||||
In all of these cases, the `__mutex_lock_common` function will acct like a `semaphore`. We try to acquire a lock again because the owner of a lock might already release a lock before this time:
|
||||
|
||||
```C
|
||||
if (!mutex_is_locked(lock) &&
|
||||
(atomic_xchg_acquire(&lock->count, 0) == 1))
|
||||
goto skip_wait;
|
||||
```
|
||||
|
||||
In a failure case the process which wants to acquire a lock will be added to the waiters list
|
||||
|
||||
```C
|
||||
list_add_tail(&waiter.list, &lock->wait_list);
|
||||
waiter.task = task;
|
||||
```
|
||||
|
||||
In a successful case we update the owner of a lock, enable preemption and exit from the `__mutex_lock_common` function:
|
||||
|
||||
```C
|
||||
skip_wait:
|
||||
mutex_set_owner(lock);
|
||||
preempt_enable();
|
||||
return 0;
|
||||
```
|
||||
|
||||
In this case a lock will be acquired. If can't acquire a lock for now, we enter into the following loop:
|
||||
|
||||
```C
|
||||
for (;;) {
|
||||
|
||||
if (atomic_read(&lock->count) >= 0 && (atomic_xchg_acquire(&lock->count, -1) == 1))
|
||||
break;
|
||||
|
||||
if (unlikely(signal_pending_state(state, task))) {
|
||||
ret = -EINTR;
|
||||
goto err;
|
||||
}
|
||||
|
||||
__set_task_state(task, state);
|
||||
|
||||
schedule_preempt_disabled();
|
||||
}
|
||||
```
|
||||
|
||||
where try to acquire a lock again and exit if this operation was successful. Yes, we try to acquire a lock again right after unsuccessful try before the loop. We need to do it to make sure that we get a wakeup once a lock will be unlocked. Besides this, it allows us to acquire a lock after sleep. In other case we check the current process for pending [signals](https://en.wikipedia.org/wiki/Unix_signal) and exit if the process was interrupted by a `signal` during wait for a lock acquisition. In the end of loop we didn't acquire a lock, so we set the task state for `TASK_UNINTERRUPTIBLE` and go to sleep with call of the `schedule_preempt_disabled` function.
|
||||
|
||||
That's all. We have considered all three possible paths through which a process may pass when it will wan to acquire a lock. Now let's consider how `mutex_unlock` is implemented. When the `mutex_unlock` will be called by a process which wants to release a lock, the `__mutex_fastpath_unlock` will be called from the [arch/x86/include/asm/mutex_64.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/mutex_64.h) header file:
|
||||
|
||||
```C
|
||||
void __sched mutex_unlock(struct mutex *lock)
|
||||
{
|
||||
__mutex_fastpath_unlock(&lock->count, __mutex_unlock_slowpath);
|
||||
}
|
||||
```
|
||||
|
||||
Implementation of the `__mutex_fastpath_unlock` function is very similar to the implementation of the `__mutex_fastpath_lock` function:
|
||||
|
||||
```C
|
||||
static inline void __mutex_fastpath_unlock(atomic_t *v,
|
||||
void (*fail_fn)(atomic_t *))
|
||||
{
|
||||
asm_volatile_goto(LOCK_PREFIX " incl %0\n"
|
||||
" jg %l[exit]\n"
|
||||
: : "m" (v->counter)
|
||||
: "memory", "cc"
|
||||
: exit);
|
||||
fail_fn(v);
|
||||
exit:
|
||||
return;
|
||||
}
|
||||
```
|
||||
|
||||
Actually, there is only one difference. We increment value if the `mutex->count`. So it will represent `unlocked` state after this operation. As `mutex` released, but we have something in the `wait queue` we need to update it. In this case the `fail_fn` function will be called which is `__mutex_unlock_slowpath`. The `__mutex_unlock_slowpath` function just gets the correct `mutex` instance by the given `mutex->count` and calls the `__mutex_unlock_common_slowpath` function:
|
||||
|
||||
```C
|
||||
__mutex_unlock_slowpath(atomic_t *lock_count)
|
||||
{
|
||||
struct mutex *lock = container_of(lock_count, struct mutex, count);
|
||||
|
||||
__mutex_unlock_common_slowpath(lock, 1);
|
||||
}
|
||||
```
|
||||
|
||||
In the `__mutex_unlock_common_slowpath` function we will get the first entry from the wait queue if the wait queue is not empty and wakeup related process:
|
||||
|
||||
```C
|
||||
if (!list_empty(&lock->wait_list)) {
|
||||
struct mutex_waiter *waiter =
|
||||
list_entry(lock->wait_list.next, struct mutex_waiter, list);
|
||||
wake_up_process(waiter->task);
|
||||
}
|
||||
```
|
||||
|
||||
After this, a mutex will be released by previous process and will be acquired by another process from a wait queue.
|
||||
|
||||
That's all. We have considered main `API` for manipulation with `mutexes`: `mutex_lock` and `mutex_unlock`. Besides this the Linux kernel provides following API:
|
||||
|
||||
* `mutex_lock_interruptible`;
|
||||
* `mutex_lock_killable`;
|
||||
* `mutex_trylock`.
|
||||
|
||||
and corresponding versions of `unlock` prefixed functions. This part will not describe this `API`, because it is similar to corresponding `API` of `semaphores`. More about it you may read in the [previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-3.html).
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the fourth part of the [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) chapter in the Linux kernel. In this part we met with new synchronization primitive which is called - `mutex`. From the theoretical side, this synchronization primitive very similar on a [semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29). Actually, `mutex` represents binary semaphore. But its implementation differs from the implementation of `semaphore` in the Linux kernel. In the next part we will continue to dive into synchronization primitives in the Linux kernel.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [Mutex](https://en.wikipedia.org/wiki/Mutual_exclusion)
|
||||
* [Spinlock](https://en.wikipedia.org/wiki/Spinlock)
|
||||
* [Semaphore](https://en.wikipedia.org/wiki/Semaphore_%28programming%29)
|
||||
* [Synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29)
|
||||
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
* [Locking mechanism](https://en.wikipedia.org/wiki/Lock_%28computer_science%29)
|
||||
* [Context switches](https://en.wikipedia.org/wiki/Context_switch)
|
||||
* [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt)
|
||||
* [Atomic](https://en.wikipedia.org/wiki/Linearizability)
|
||||
* [MCS lock](http://www.cs.rochester.edu/~scott/papers/1991_TOCS_synch.pdf)
|
||||
* [Doubly linked list](https://0xax.gitbooks.io/linux-insides/content/DataStructures/dlist.html)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [Inline assembly](https://0xax.gitbooks.io/linux-insides/content/Theory/asm.html)
|
||||
* [Memory barrier](https://en.wikipedia.org/wiki/Memory_barrier)
|
||||
* [Lock instruction](http://x86.renejeschke.de/html/file_module_x86_id_159.html)
|
||||
* [JNS instruction](http://unixwiz.net/techtips/x86-jumps.html)
|
||||
* [preemtion](https://en.wikipedia.org/wiki/Preemption_%28computing%29)
|
||||
* [Unix signals](https://en.wikipedia.org/wiki/Unix_signal)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-3.html)
|
@ -0,0 +1,352 @@
|
||||
Synchronization primitives in the Linux kernel. Part 6.
|
||||
================================================================================
|
||||
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the sixth part of the chapter which describes [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_(computer_science)) in the Linux kernel and in the previous parts we finished to consider different [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) synchronization primitives. We will continue to learn synchronization primitives in this part and start to consider a similar synchronization primitive which can be used to avoid the `writer starvation` problem. The name of this synchronization primitive is - `seqlock` or `sequential locks`.
|
||||
|
||||
We know from the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-5.html) that [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) is a special lock mechanism which allows concurrent access for read-only operations, but an exclusive lock is needed for writing or modifying data. As we may guess, it may lead to a problem which is called `writer starvation`. In other words, a writer process can't acquire a lock as long as at least one reader process which aqcuired a lock holds it. So, in the situation when contention is high, it will lead to situation when a writer process which wants to acquire a lock will wait for it for a long time.
|
||||
|
||||
The `seqlock` synchronization primitive can help solve this problem.
|
||||
|
||||
As in all previous parts of this [book](https://0xax.gitbooks.io/linux-insides/content), we will try to consider this synchronization primitive from the theoretical side and only than we will consider [API](https://en.wikipedia.org/wiki/Application_programming_interface) provided by the Linux kernel to manipulate with `seqlocks`.
|
||||
|
||||
So, let's start.
|
||||
|
||||
Sequential lock
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
So, what is a `seqlock` synchronization primitive and how does it work? Let's try to answer on these questions in this paragraph. Actually `sequential locks` were introduced in the Linux kernel 2.6.x. Main point of this synchronization primitive is to provide fast and lock-free access to shared resources. Since the heart of `sequential lock` synchronization primitive is [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) synchronization primitive, `sequential locks` work in situations where the protected resources are small and simple. Additionally write access must be rare and also should be fast.
|
||||
|
||||
Work of this synchronization primitive is based on the sequence of events counter. Actually a `sequential lock` allows free access to a resource for readers, but each reader must check existence of conflicts with a writer. This synchronization primitive introduces a special counter. The main algorithm of work of `sequential locks` is simple: Each writer which acquired a sequential lock increments this counter and additionally acquires a [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html). When this writer finishes, it will release the acquired spinlock to give access to other writers and increment the counter of a sequential lock again.
|
||||
|
||||
Read only access works on the following principle, it gets the value of a `sequential lock` counter before it will enter into [critical section](https://en.wikipedia.org/wiki/Critical_section) and compares it with the value of the same `sequential lock` counter at the exit of critical section. If their values are equal, this means that there weren't writers for this period. If their values are not equal, this means that a writer has incremented the counter during the [critical section](https://en.wikipedia.org/wiki/Critical_section). This conflict means that reading of protected data must be repeated.
|
||||
|
||||
That's all. As we may see principle of work of `sequential locks` is simple.
|
||||
|
||||
```C
|
||||
unsigned int seq_counter_value;
|
||||
|
||||
do {
|
||||
seq_counter_value = get_seq_counter_val(&the_lock);
|
||||
//
|
||||
// do as we want here
|
||||
//
|
||||
} while (__retry__);
|
||||
```
|
||||
|
||||
Actually the Linux kernel does not provide `get_seq_counter_val()` function. Here it is just a stub. Like a `__retry__` too. As I already wrote above, we will see actual the [API](https://en.wikipedia.org/wiki/Application_programming_interface) for this in the next paragraph of this part.
|
||||
|
||||
Ok, now we know what a `seqlock` synchronization primitive is and how it is represented in the Linux kernel. In this case, we may go ahead and start to look at the [API](https://en.wikipedia.org/wiki/Application_programming_interface) which the Linux kernel provides for manipulation of synchronization primitives of this type.
|
||||
|
||||
Sequential lock API
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
So, now we know a little about `sequentional lock` synchronization primitive from theoretical side, let's look at its implementation in the Linux kernel. All `sequentional locks` [API](https://en.wikipedia.org/wiki/Application_programming_interface) are located in the [include/linux/seqlock.h](https://github.com/torvalds/linux/blob/master/include/linux/seqlock.h) header file.
|
||||
|
||||
First of all we may see that the a `sequential lock` machanism is represented by the following type:
|
||||
|
||||
```C
|
||||
typedef struct {
|
||||
struct seqcount seqcount;
|
||||
spinlock_t lock;
|
||||
} seqlock_t;
|
||||
```
|
||||
|
||||
As we may see the `seqlock_t` provides two fields. These fields represent a sequential lock counter, description of which we saw above and also a [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) which will protect data from other writers. Note that the `seqcount` counter represented as `seqcount` type. The `seqcount` is structure:
|
||||
|
||||
```C
|
||||
typedef struct seqcount {
|
||||
unsigned sequence;
|
||||
#ifdef CONFIG_DEBUG_LOCK_ALLOC
|
||||
struct lockdep_map dep_map;
|
||||
#endif
|
||||
} seqcount_t;
|
||||
```
|
||||
|
||||
which holds counter of a sequential lock and [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) related field.
|
||||
|
||||
As always in previous parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/), before we will consider an [API](https://en.wikipedia.org/wiki/Application_programming_interface) of `sequential lock` mechanism in the Linux kernel, we need to know how to initialize an instance of `seqlock_t`.
|
||||
|
||||
We saw in the previous parts that often the Linux kernel provides two approaches to execute initialization of the given synchronization primitive. The same situation with the `seqlock_t` structure. These approaches allows to initialize a `seqlock_t` in two following:
|
||||
|
||||
* `statically`;
|
||||
* `dynamically`.
|
||||
|
||||
ways. Let's look at the first approach. We are able to intialize a `seqlock_t` statically with the `DEFINE_SEQLOCK` macro:
|
||||
|
||||
```C
|
||||
#define DEFINE_SEQLOCK(x) \
|
||||
seqlock_t x = __SEQLOCK_UNLOCKED(x)
|
||||
```
|
||||
|
||||
which is defined in the [include/linux/seqlock.h](https://github.com/torvalds/linux/blob/master/include/linux/seqlock.h) header file. As we may see, the `DEFINE_SEQLOCK` macro takes one argument and expands to the definition and initialization of the `seqlock_t` structure. Initialization occurs with the help of the `__SEQLOCK_UNLOCKED` macro which is defined in the same source code file. Let's look at the implementation of this macro:
|
||||
|
||||
```C
|
||||
#define __SEQLOCK_UNLOCKED(lockname) \
|
||||
{ \
|
||||
.seqcount = SEQCNT_ZERO(lockname), \
|
||||
.lock = __SPIN_LOCK_UNLOCKED(lockname) \
|
||||
}
|
||||
```
|
||||
|
||||
As we may see the, `__SEQLOCK_UNLOCKED` macro executes initialization of fields of the given `seqlock_t` structure. The first field is `seqcount` initialized with the `SEQCNT_ZERO` macro which expands to the:
|
||||
|
||||
```C
|
||||
#define SEQCNT_ZERO(lockname) { .sequence = 0, SEQCOUNT_DEP_MAP_INIT(lockname)}
|
||||
```
|
||||
|
||||
So we just initialize counter of the given sequential lock to zero and additionally we can see [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) related initialization which depends on the state of the `CONFIG_DEBUG_LOCK_ALLOC` kernel configuration option:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_DEBUG_LOCK_ALLOC
|
||||
# define SEQCOUNT_DEP_MAP_INIT(lockname) \
|
||||
.dep_map = { .name = #lockname } \
|
||||
...
|
||||
...
|
||||
...
|
||||
#else
|
||||
# define SEQCOUNT_DEP_MAP_INIT(lockname)
|
||||
...
|
||||
...
|
||||
...
|
||||
#endif
|
||||
```
|
||||
|
||||
As I already wrote in previous parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/) we will not consider [debugging](https://en.wikipedia.org/wiki/Debugging) and [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) related stuff in this part. So for now we just skip the `SEQCOUNT_DEP_MAP_INIT` macro. The second field of the given `seqlock_t` is `lock` initialized with the `__SPIN_LOCK_UNLOCKED` macro which is defined in the [include/linux/spinlock_types.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock_types.h) header file. We will not consider implementation of this macro here as it just initialize [rawspinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) with architecture-specific methods (More abot spinlocks you may read in first parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/)).
|
||||
|
||||
We have considered the first way to initialize a sequential lock. Let's consider second way to do the same, but do it dynamically. We can initialize a sequentional lock with the `seqlock_init` macro which is defined in the same [include/linux/seqlock.h](https://github.com/torvalds/linux/blob/master/include/linux/seqlock.h) header file.
|
||||
|
||||
Let's look at the implementation of this macro:
|
||||
|
||||
```C
|
||||
#define seqlock_init(x) \
|
||||
do { \
|
||||
seqcount_init(&(x)->seqcount); \
|
||||
spin_lock_init(&(x)->lock); \
|
||||
} while (0)
|
||||
```
|
||||
|
||||
As we may see, the `seqlock_init` expands into two macros. The first macro `seqcount_init` takes counter of the given sequential lock and expands to the call of the `__seqcount_init` function:
|
||||
|
||||
```C
|
||||
# define seqcount_init(s) \
|
||||
do { \
|
||||
static struct lock_class_key __key; \
|
||||
__seqcount_init((s), #s, &__key); \
|
||||
} while (0)
|
||||
```
|
||||
|
||||
from the same header file. This function
|
||||
|
||||
```C
|
||||
static inline void __seqcount_init(seqcount_t *s, const char *name,
|
||||
struct lock_class_key *key)
|
||||
{
|
||||
lockdep_init_map(&s->dep_map, name, key, 0);
|
||||
s->sequence = 0;
|
||||
}
|
||||
```
|
||||
|
||||
just initializes counter of the given `seqcount_t` with zero. The second call from the `seqlock_init` macro is the call of the `spin_lock_init` macro which we saw in the [first part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) of this chapter.
|
||||
|
||||
So, now we know how to initialize a `sequential lock`, now let's look at how to use it. The Linux kernel provides following [API](https://en.wikipedia.org/wiki/Application_programming_interface) to manipulate `sequential locks`:
|
||||
|
||||
```C
|
||||
static inline unsigned read_seqbegin(const seqlock_t *sl);
|
||||
static inline unsigned read_seqretry(const seqlock_t *sl, unsigned start);
|
||||
static inline void write_seqlock(seqlock_t *sl);
|
||||
static inline void write_sequnlock(seqlock_t *sl);
|
||||
static inline void write_seqlock_irq(seqlock_t *sl);
|
||||
static inline void write_sequnlock_irq(seqlock_t *sl);
|
||||
static inline void read_seqlock_excl(seqlock_t *sl)
|
||||
static inline void read_sequnlock_excl(seqlock_t *sl)
|
||||
```
|
||||
|
||||
and others. Before we move on to considering the implementation of this [API](https://en.wikipedia.org/wiki/Application_programming_interface), we must know that actually there are two types of readers. The first type of reader never blocks a writer process. In this case writer will not wait for readers. The second type of reader which can lock. In this case, the locking reader will block the writer as it will wait while reader will not release its lock.
|
||||
|
||||
First of all let's consider the first type of readers. The `read_seqbegin` function begins a seq-read [critical section](https://en.wikipedia.org/wiki/Critical_section).
|
||||
|
||||
As we may see this function just returns value of the `read_seqcount_begin` function:
|
||||
|
||||
```C
|
||||
static inline unsigned read_seqbegin(const seqlock_t *sl)
|
||||
{
|
||||
return read_seqcount_begin(&sl->seqcount);
|
||||
}
|
||||
```
|
||||
|
||||
In its turn the `read_seqcount_begin` function calls the `raw_read_seqcount_begin` function:
|
||||
|
||||
```C
|
||||
static inline unsigned read_seqcount_begin(const seqcount_t *s)
|
||||
{
|
||||
return raw_read_seqcount_begin(s);
|
||||
}
|
||||
```
|
||||
|
||||
which just returns value of the `sequential lock` counter:
|
||||
|
||||
```C
|
||||
static inline unsigned raw_read_seqcount(const seqcount_t *s)
|
||||
{
|
||||
unsigned ret = READ_ONCE(s->sequence);
|
||||
smp_rmb();
|
||||
return ret;
|
||||
}
|
||||
```
|
||||
|
||||
After we have the initial value of the given `sequential lock` counter and did some stuff, we know from the previous paragraph of this function, that we need to compare it with the current value of the counter the same `sequential lock` before we will exit from the critical section. We can achieve this by the call of the `read_seqretry` function. This function takes a `sequential lock`, start value of the counter and through a chain of functions:
|
||||
|
||||
```C
|
||||
static inline unsigned read_seqretry(const seqlock_t *sl, unsigned start)
|
||||
{
|
||||
return read_seqcount_retry(&sl->seqcount, start);
|
||||
}
|
||||
|
||||
static inline int read_seqcount_retry(const seqcount_t *s, unsigned start)
|
||||
{
|
||||
smp_rmb();
|
||||
return __read_seqcount_retry(s, start);
|
||||
}
|
||||
```
|
||||
|
||||
it calls the `__read_seqcount_retry` function:
|
||||
|
||||
```C
|
||||
static inline int __read_seqcount_retry(const seqcount_t *s, unsigned start)
|
||||
{
|
||||
return unlikely(s->sequence != start);
|
||||
}
|
||||
```
|
||||
|
||||
which just compares value of the counter of the given `sequential lock` with the initial value of this counter. If the initial value of the counter which is obtained from `read_seqbegin()` function is odd, this means that a writer was in the middle of updating the data when our reader began to act. In this case the value of the data can be in inconsistent state, so we need to try to read it again.
|
||||
|
||||
This is a common pattern in the Linux kernel. For example, you may remember the `jiffies` concept from the [first part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) of the [timers and time management in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Timers/) chapter. The sequential lock is used to obtain value of `jiffies` at [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture:
|
||||
|
||||
```C
|
||||
u64 get_jiffies_64(void)
|
||||
{
|
||||
unsigned long seq;
|
||||
u64 ret;
|
||||
|
||||
do {
|
||||
seq = read_seqbegin(&jiffies_lock);
|
||||
ret = jiffies_64;
|
||||
} while (read_seqretry(&jiffies_lock, seq));
|
||||
return ret;
|
||||
}
|
||||
```
|
||||
|
||||
Here we just read the value of the counter of the `jiffies_lock` sequential lock and then we write value of the `jiffies_64` system variable to the `ret`. As here we may see `do/while` loop, the body of the loop will be executed at least one time. So, as the body of loop was executed, we read and compare the current value of the counter of the `jiffies_lock` with the initial value. If these values are not equal, execution of the loop will be repeated, else `get_jiffies_64` will return its value in `ret`.
|
||||
|
||||
We just saw the first type of readers which do not block writer and other readers. Let's consider second type. It does not update value of a `sequential lock` counter, but just locks `spinlock`:
|
||||
|
||||
```C
|
||||
static inline void read_seqlock_excl(seqlock_t *sl)
|
||||
{
|
||||
spin_lock(&sl->lock);
|
||||
}
|
||||
```
|
||||
|
||||
So, no one reader or writer can't access protected data. When a reader finishes, the lock must be unlocked with the:
|
||||
|
||||
```C
|
||||
static inline void read_sequnlock_excl(seqlock_t *sl)
|
||||
{
|
||||
spin_unlock(&sl->lock);
|
||||
}
|
||||
```
|
||||
|
||||
function.
|
||||
|
||||
Now we know how `sequential lock` work for readers. Let's consider how does writer act when it wants to acquire a `sequential lock` to modify data. To acquire a `sequential lock`, writer should use `write_seqlock` function. If we look at the implementation of this function:
|
||||
|
||||
```C
|
||||
static inline void write_seqlock(seqlock_t *sl)
|
||||
{
|
||||
spin_lock(&sl->lock);
|
||||
write_seqcount_begin(&sl->seqcount);
|
||||
}
|
||||
```
|
||||
|
||||
We will see that it acquires `spinlock` to prevent access from other writers and calls the `write_seqcount_begin` function. This function just increments value of the `sequential lock` counter:
|
||||
|
||||
```C
|
||||
static inline void raw_write_seqcount_begin(seqcount_t *s)
|
||||
{
|
||||
s->sequence++;
|
||||
smp_wmb();
|
||||
}
|
||||
```
|
||||
|
||||
When a writer process will finish to modify data, the `write_sequnlock` function must be called to release a lock and give access to other writers or readers. Let's consider at the implementation of the `write_sequnlock` function. It looks pretty simple:
|
||||
|
||||
```C
|
||||
static inline void write_sequnlock(seqlock_t *sl)
|
||||
{
|
||||
write_seqcount_end(&sl->seqcount);
|
||||
spin_unlock(&sl->lock);
|
||||
}
|
||||
```
|
||||
|
||||
First of all it just calls `write_seqcount_end` function to increase value of the counter of the `sequential` lock again:
|
||||
|
||||
```C
|
||||
static inline void raw_write_seqcount_end(seqcount_t *s)
|
||||
{
|
||||
smp_wmb();
|
||||
s->sequence++;
|
||||
}
|
||||
```
|
||||
|
||||
and in the end we just call the `spin_unlock` macro to give access for other readers or writers.
|
||||
|
||||
That's all about `sequential lock` mechanism in the Linux kernel. Of course we did not consider full [API](https://en.wikipedia.org/wiki/Application_programming_interface) of this mechanism in this part. But all other functions are based on these which we described here. For example, Linux kernel also provides some safe macros/functions to use `sequential lock` mechanism in [interrupt handlers](https://en.wikipedia.org/wiki/Interrupt_handler) of [softirq](https://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html): `write_seqclock_irq` and `write_sequnlock_irq`:
|
||||
|
||||
```C
|
||||
static inline void write_seqlock_irq(seqlock_t *sl)
|
||||
{
|
||||
spin_lock_irq(&sl->lock);
|
||||
write_seqcount_begin(&sl->seqcount);
|
||||
}
|
||||
|
||||
static inline void write_sequnlock_irq(seqlock_t *sl)
|
||||
{
|
||||
write_seqcount_end(&sl->seqcount);
|
||||
spin_unlock_irq(&sl->lock);
|
||||
}
|
||||
```
|
||||
|
||||
As we may see, these functions differ only in the initialization of spinlock. They call `spin_lock_irq` and `spin_unlock_irq` instead of `spin_lock` and `spin_unlock`.
|
||||
|
||||
Or for example `write_seqlock_irqsave` and `write_sequnlock_irqrestore` functions which are the same but used `spin_lock_irqsave` and `spin_unlock_irqsave` macro to use in [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_(PC_architecture)) handlers.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the sixth part of the [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) chapter in the Linux kernel. In this part we met with new synchronization primitive which is called - `sequential lock`. From the theoretical side, this synchronization primitive very similar on a [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) synchronization primitive, but allows to avoid `writer-starving` issue.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_(computer_science))
|
||||
* [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock)
|
||||
* [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html)
|
||||
* [critical section](https://en.wikipedia.org/wiki/Critical_section)
|
||||
* [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt)
|
||||
* [debugging](https://en.wikipedia.org/wiki/Debugging)
|
||||
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [Timers and time management in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Timers/)
|
||||
* [interrupt handlers](https://en.wikipedia.org/wiki/Interrupt_handler)
|
||||
* [softirq](https://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html)
|
||||
* [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_(PC_architecture))
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-5.html)
|
@ -1,6 +1,8 @@
|
||||
# System calls
|
||||
|
||||
This chapter describes the `system call` concept in the linux kernel. You will see here a
|
||||
couple of posts which describe the full cycle of the kernel loading process:
|
||||
This chapter describes the `system call` concept in the linux kernel.
|
||||
|
||||
* [Introduction to system call concept](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) - this part is introduction to the `system call` concept in the Linux kernel.
|
||||
* [How the Linux kernel handles a system call](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) - this part describes how the Linux kernel handles a system call from an userspace application.
|
||||
* [vsyscall and vDSO](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-3.html) - third part describes `vsyscall` and `vDSO` concepts.
|
||||
* [How the Linux kernel runs a program](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-4.html) - this part describes startup process of a program.
|
||||
|
@ -0,0 +1,409 @@
|
||||
System calls in the Linux kernel. Part 2.
|
||||
================================================================================
|
||||
|
||||
How does the Linux kernel handle a system call
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) was the first part of the chapter that describes the [system call](https://en.wikipedia.org/wiki/System_call) concepts in the Linux kernel.
|
||||
In the previous part we learned what a system call is in the Linux kernel, and in operating systems in general. This was introduced from a user-space perspective, and part of the [write](http://man7.org/linux/man-pages/man2/write.2.html) system call implementation was discussed. In this part we continue our look at system calls, starting with some theory before moving onto the Linux kernel code.
|
||||
|
||||
A user application does not make the system call directly from our applications. We did not write the `Hello world!` program like:
|
||||
|
||||
```C
|
||||
int main(int argc, char **argv)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
sys_write(fd1, buf, strlen(buf));
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
We can use something similar with the help of [C standard library](https://en.wikipedia.org/wiki/GNU_C_Library) and it will look something like this:
|
||||
|
||||
```C
|
||||
#include <unistd.h>
|
||||
|
||||
int main(int argc, char **argv)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
write(fd1, buf, strlen(buf));
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
But anyway, `write` is not a direct system call and not a kernel function. An application must fill general purpose registers with the correct values in the correct order and use the `syscall` instruction to make the actual system call. In this part we will look at what occurs in the Linux kernel when the `syscall` instruction is met by the processor.
|
||||
|
||||
Initialization of the system calls table
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
From the previous part we know that system call concept is very similar to an interrupt. Furthermore, system calls are implemented as software interrupts. So, when the processor handles a `syscall` instruction from a user application, this instruction causes an exception which transfers control to an exception handler. As we know, all exception handlers (or in other words kernel [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) functions that will react on an exception) are placed in the kernel code. But how does the Linux kernel search for the address of the necessary system call handler for the related system call? The Linux kernel contains a special table called the `system call table`. The system call table is represented by the `sys_call_table` array in the Linux kernel which is defined in the [arch/x86/entry/syscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscall_64.c) source code file. Let's look at its implementation:
|
||||
|
||||
```C
|
||||
asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
|
||||
[0 ... __NR_syscall_max] = &sys_ni_syscall,
|
||||
#include <asm/syscalls_64.h>
|
||||
};
|
||||
```
|
||||
|
||||
As we can see, the `sys_call_table` is an array of `__NR_syscall_max + 1` size where the `__NR_syscall_max` macro represents the maximum number of system calls for the given [architecture](https://en.wikipedia.org/wiki/List_of_CPU_architectures). This book is about the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture, so for our case the `__NR_syscall_max` is `322` and this is the correct number at the time of writing (current Linux kernel version is `4.2.0-rc8+`). We can see this macro in the header file generated by [Kbuild](https://www.kernel.org/doc/Documentation/kbuild/makefiles.txt) during kernel compilation - include/generated/asm-offsets.h`:
|
||||
|
||||
```C
|
||||
#define __NR_syscall_max 322
|
||||
```
|
||||
|
||||
There will be the same number of system calls in the [arch/x86/entry/syscalls/syscall_64.tbl](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L331) for the `x86_64`. There are two important topics here; the type of the `sys_call_table` array, and the initialization of elements in this array. First of all, the type. The `sys_call_ptr_t` represents a pointer to a system call table. It is defined as [typedef](https://en.wikipedia.org/wiki/Typedef) for a function pointer that returns nothing and does not take arguments:
|
||||
|
||||
```C
|
||||
typedef void (*sys_call_ptr_t)(void);
|
||||
```
|
||||
|
||||
The second thing is the initialization of the `sys_call_table` array. As we can see in the code above, all elements of our array that contain pointers to the system call handlers point to the `sys_ni_syscall`. The `sys_ni_syscall` function represents not-implemented system calls. To start with, all elements of the `sys_call_table` array point to the not-implemented system call. This is the correct initial behaviour, because we only initialize storage of the pointers to the system call handlers, it is populated later on. Implementation of the `sys_ni_syscall` is pretty easy, it just returns [-errno](http://man7.org/linux/man-pages/man3/errno.3.html) or `-ENOSYS` in our case:
|
||||
|
||||
```C
|
||||
asmlinkage long sys_ni_syscall(void)
|
||||
{
|
||||
return -ENOSYS;
|
||||
}
|
||||
```
|
||||
|
||||
The `-ENOSYS` error tells us that:
|
||||
|
||||
```
|
||||
ENOSYS Function not implemented (POSIX.1)
|
||||
```
|
||||
|
||||
Also a note on `...` in the initialization of the `sys_call_table`. We can do it with a [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) compiler extension called - [Designated Initializers](https://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html). This extension allows us to initialize elements in non-fixed order. As you can see, we include the `asm/syscalls_64.h` header at the end of the array. This header file is generated by the special script at [arch/x86/entry/syscalls/syscalltbl.sh](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscalltbl.sh) and generates our header file from the [syscall table](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl). The `asm/syscalls_64.h` contains definitions of the following macros:
|
||||
|
||||
```C
|
||||
__SYSCALL_COMMON(0, sys_read, sys_read)
|
||||
__SYSCALL_COMMON(1, sys_write, sys_write)
|
||||
__SYSCALL_COMMON(2, sys_open, sys_open)
|
||||
__SYSCALL_COMMON(3, sys_close, sys_close)
|
||||
__SYSCALL_COMMON(5, sys_newfstat, sys_newfstat)
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
The `__SYSCALL_COMMON` macro is defined in the same source code [file](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscall_64.c) and expands to the `__SYSCALL_64` macro which expands to the function definition:
|
||||
|
||||
```C
|
||||
#define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat)
|
||||
#define __SYSCALL_64(nr, sym, compat) [nr] = sym,
|
||||
```
|
||||
|
||||
So, after this, our `sys_call_table` takes the following form:
|
||||
|
||||
```C
|
||||
asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
|
||||
[0 ... __NR_syscall_max] = &sys_ni_syscall,
|
||||
[0] = sys_read,
|
||||
[1] = sys_write,
|
||||
[2] = sys_open,
|
||||
...
|
||||
...
|
||||
...
|
||||
};
|
||||
```
|
||||
|
||||
After this all elements that point to the non-implemented system calls will contain the address of the `sys_ni_syscall` function that just returns `-ENOSYS` as we saw above, and other elements will point to the `sys_syscall_name` functions.
|
||||
|
||||
At this point, we have filled the system call table and the Linux kernel knows where each system call handler is. But the Linux kernel does not call a `sys_syscall_name` function immediately after it is instructed to handle a system call from a user space application. Remember the [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) about interrupts and interrupt handling. When the Linux kernel gets the control to handle an interrupt, it had to do some preparations like save user space registers, switch to a new stack and many more tasks before it will call an interrupt handler. There is the same situation with the system call handling. The preparation for handling a system call is the first thing, but before the Linux kernel will start these preparations, the entry point of a system call must be initialized and only the Linux kernel knows how to perform this preparation. In the next paragraph we will see the process of the initialization of the system call entry in the Linux kernel.
|
||||
|
||||
Initialization of the system call entry
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
When a system call occurs in the system, where are the first bytes of code that starts to handle it? As we can read in the Intel manual - [64-ia-32-architectures-software-developer-vol-2b-manual](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html):
|
||||
|
||||
```
|
||||
SYSCALL invokes an OS system-call handler at privilege level 0.
|
||||
It does so by loading RIP from the IA32_LSTAR MSR
|
||||
```
|
||||
|
||||
it means that we need to put the system call entry in to the `IA32_LSTAR` [model specific register](https://en.wikipedia.org/wiki/Model-specific_register). This operation takes place during the Linux kernel initialization process. If you have read the fourth [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-4.html) of the chapter that describes interrupts and interrupt handling in the Linux kernel, you know that the Linux kernel calls the `trap_init` function during the initialization process. This function is defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file and executes the initialization of the `non-early` exception handlers like divide error, [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) error etc. Besides the initialization of the `non-early` exceptions handlers, this function calls the `cpu_init` function from the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/blob/arch/x86/kernel/cpu/common.c) source code file which besides initialization of `per-cpu` state, calls the `syscall_init` function from the same source code file.
|
||||
|
||||
This function performs the initialization of the system call entry point. Let's look on the implementation of this function. It does not take parameters and first of all it fills two model specific registers:
|
||||
|
||||
```C
|
||||
wrmsrl(MSR_STAR, ((u64)__USER32_CS)<<48 | ((u64)__KERNEL_CS)<<32);
|
||||
wrmsrl(MSR_LSTAR, entry_SYSCALL_64);
|
||||
```
|
||||
|
||||
The first model specific register - `MSR_STAR` contains `63:48` bits of the user code segment. These bits will be loaded to the `CS` and `SS` segment registers for the `sysret` instruction which provides functionality to return from a system call to user code with the related privilege. Also the `MSR_STAR` contains `47:32` bits from the kernel code that will be used as the base selector for `CS` and `SS` segment registers when user space applications execute a system call. In the second line of code we fill the `MSR_LSTAR` register with the `entry_SYSCALL_64` symbol that represents system call entry. The `entry_SYSCALL_64` is defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file and contains code related to the preparation performed before a system call handler will be executed (I already wrote about these preparations, read above). We will not consider the `entry_SYSCALL_64` now, but will return to it later in this chapter.
|
||||
|
||||
After we have set the entry point for system calls, we need to set the following model specific registers:
|
||||
|
||||
* `MSR_CSTAR` - target `rip` for the compatibility mode callers;
|
||||
* `MSR_IA32_SYSENTER_CS` - target `cs` for the `sysenter` instruction;
|
||||
* `MSR_IA32_SYSENTER_ESP` - target `esp` for the `sysenter` instruction;
|
||||
* `MSR_IA32_SYSENTER_EIP` - target `eip` for the `sysenter` instruction.
|
||||
|
||||
The values of these model specific register depend on the `CONFIG_IA32_EMULATION` kernel configuration option. If this kernel configuration option is enabled, it allows legacy 32-bit programs to run under a 64-bit kernel. In the first case, if the `CONFIG_IA32_EMULATION` kernel configuration option is enabled, we fill these model specific registers with the entry point for the system calls the compatibility mode:
|
||||
|
||||
```C
|
||||
wrmsrl(MSR_CSTAR, entry_SYSCALL_compat);
|
||||
```
|
||||
|
||||
and with the kernel code segment, put zero to the stack pointer and write the address of the `entry_SYSENTER_compat` symbol to the [instruction pointer](https://en.wikipedia.org/wiki/Program_counter):
|
||||
|
||||
```C
|
||||
wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
|
||||
wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
|
||||
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
|
||||
```
|
||||
|
||||
In another way, if the `CONFIG_IA32_EMULATION` kernel configuration option is disabled, we write `ignore_sysret` symbol to the `MSR_CSTAR`:
|
||||
|
||||
```C
|
||||
wrmsrl(MSR_CSTAR, ignore_sysret);
|
||||
```
|
||||
|
||||
that is defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file and just returns `-ENOSYS` error code:
|
||||
|
||||
```assembly
|
||||
ENTRY(ignore_sysret)
|
||||
mov $-ENOSYS, %eax
|
||||
sysret
|
||||
END(ignore_sysret)
|
||||
```
|
||||
|
||||
Now we need to fill `MSR_IA32_SYSENTER_CS`, `MSR_IA32_SYSENTER_ESP`, `MSR_IA32_SYSENTER_EIP` model specific registers as we did in the previous code when the `CONFIG_IA32_EMULATION` kernel configuration option was enabled. In this case (when the `CONFIG_IA32_EMULATION` configuration option is not set) we fill the `MSR_IA32_SYSENTER_ESP` and the `MSR_IA32_SYSENTER_EIP` with zero and put the invalid segment of the [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table) to the `MSR_IA32_SYSENTER_CS` model specific register:
|
||||
|
||||
```C
|
||||
wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
|
||||
wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
|
||||
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
|
||||
```
|
||||
|
||||
You can read more about the `Global Descriptor Table` in the second [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) of the chapter that describes the booting process of the Linux kernel.
|
||||
|
||||
At the end of the `syscall_init` function, we just mask flags in the [flags register](https://en.wikipedia.org/wiki/FLAGS_register) by writing the set of flags to the `MSR_SYSCALL_MASK` model specific register:
|
||||
|
||||
```C
|
||||
wrmsrl(MSR_SYSCALL_MASK,
|
||||
X86_EFLAGS_TF|X86_EFLAGS_DF|X86_EFLAGS_IF|
|
||||
X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT);
|
||||
```
|
||||
|
||||
These flags will be cleared during syscall initialization. That's all, it is the end of the `syscall_init` function and it means that system call entry is ready to work. Now we can see what will occur when a user application executes the `syscall` instruction.
|
||||
|
||||
Preparation before system call handler will be called
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As I already wrote, before a system call or an interrupt handler will be called by the Linux kernel we need to do some preparations. The `idtentry` macro performs the preparations required before an exception handler will be executed, the `interrupt` macro performs the preparations required before an interrupt handler will be called and the `entry_SYSCALL_64` will do the preparations required before a system call handler will be executed.
|
||||
|
||||
The `entry_SYSCALL_64` is defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file and starts from the following macro:
|
||||
|
||||
```assembly
|
||||
SWAPGS_UNSAFE_STACK
|
||||
```
|
||||
|
||||
This macro is defined in the [arch/x86/include/asm/irqflags.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irqflags.h) header file and expands to the `swapgs` instruction:
|
||||
|
||||
```C
|
||||
#define SWAPGS_UNSAFE_STACK swapgs
|
||||
```
|
||||
|
||||
which exchanges the current GS base register value with the value contained in the `MSR_KERNEL_GS_BASE ` model specific register. In other words we moved it on to the kernel stack. After this we point the old stack pointer to the `rsp_scratch` [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable and setup the stack pointer to point to the top of stack for the current processor:
|
||||
|
||||
```assembly
|
||||
movq %rsp, PER_CPU_VAR(rsp_scratch)
|
||||
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
|
||||
```
|
||||
|
||||
In the next step we push the stack segment and the old stack pointer to the stack:
|
||||
|
||||
```assembly
|
||||
pushq $__USER_DS
|
||||
pushq PER_CPU_VAR(rsp_scratch)
|
||||
```
|
||||
|
||||
After this we enable interrupts, because interrupts are `off` on entry and save the general purpose [registers](https://en.wikipedia.org/wiki/Processor_register) (besides `bp`, `bx` and from `r12` to `r15`), flags, `-ENOSYS` for the non-implemented system call and code segment register on the stack:
|
||||
|
||||
```assembly
|
||||
ENABLE_INTERRUPTS(CLBR_NONE)
|
||||
|
||||
pushq %r11
|
||||
pushq $__USER_CS
|
||||
pushq %rcx
|
||||
pushq %rax
|
||||
pushq %rdi
|
||||
pushq %rsi
|
||||
pushq %rdx
|
||||
pushq %rcx
|
||||
pushq $-ENOSYS
|
||||
pushq %r8
|
||||
pushq %r9
|
||||
pushq %r10
|
||||
pushq %r11
|
||||
sub $(6*8), %rsp
|
||||
```
|
||||
|
||||
When a system call occurs from the user's application, general purpose registers have the following state:
|
||||
|
||||
* `rax` - contains system call number;
|
||||
* `rcx` - contains return address to the user space;
|
||||
* `r11` - contains register flags;
|
||||
* `rdi` - contains first argument of a system call handler;
|
||||
* `rsi` - contains second argument of a system call handler;
|
||||
* `rdx` - contains third argument of a system call handler;
|
||||
* `r10` - contains fourth argument of a system call handler;
|
||||
* `r8` - contains fifth argument of a system call handler;
|
||||
* `r9` - contains sixth argument of a system call handler;
|
||||
|
||||
Other general purpose registers (as `rbp`, `rbx` and from `r12` to `r15`) are callee-preserved in [C ABI](http://www.x86-64.org/documentation/abi.pdf)). So we push register flags on the top of the stack, then user code segment, return address to the user space, system call number, first three arguments, dump error code for the non-implemented system call and other arguments on the stack.
|
||||
|
||||
In the next step we check the `_TIF_WORK_SYSCALL_ENTRY` in the current `thread_info`:
|
||||
|
||||
```assembly
|
||||
testl $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
|
||||
jnz tracesys
|
||||
```
|
||||
|
||||
The `_TIF_WORK_SYSCALL_ENTRY` macro is defined in the [arch/x86/include/asm/thread_info.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/thread_info.h) header file and provides set of the thread information flags that are related to the system calls tracing:
|
||||
|
||||
```C
|
||||
#define _TIF_WORK_SYSCALL_ENTRY \
|
||||
(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT | \
|
||||
_TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT | \
|
||||
_TIF_NOHZ)
|
||||
```
|
||||
|
||||
We will not consider debugging/tracing related stuff in this chapter, but will see it in the separate chapter that will be devoted to the debugging and tracing techniques in the Linux kernel. After the `tracesys` label, the next label is the `entry_SYSCALL_64_fastpath`. In the `entry_SYSCALL_64_fastpath` we check the `__SYSCALL_MASK` that is defined in the [arch/x86/include/asm/unistd.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/unistd.h) header file and
|
||||
|
||||
```C
|
||||
# ifdef CONFIG_X86_X32_ABI
|
||||
# define __SYSCALL_MASK (~(__X32_SYSCALL_BIT))
|
||||
# else
|
||||
# define __SYSCALL_MASK (~0)
|
||||
# endif
|
||||
```
|
||||
|
||||
where the `__X32_SYSCALL_BIT` is
|
||||
|
||||
```C
|
||||
#define __X32_SYSCALL_BIT 0x40000000
|
||||
```
|
||||
|
||||
As we can see the `__SYSCALL_MASK` depends on the `CONFIG_X86_X32_ABI` kernel configuration option and represents the mask for the 32-bit [ABI](https://en.wikipedia.org/wiki/Application_binary_interface) in the 64-bit kernel.
|
||||
|
||||
So we check the value of the `__SYSCALL_MASK` and if the `CONFIG_X86_X32_ABI` is disabled we compare the value of the `rax` register to the maximum syscall number (`__NR_syscall_max`), alternatively if the `CONFIG_X86_X32_ABI` is enabled we mask the `eax` register with the `__X32_SYSCALL_BIT` and do the same comparison:
|
||||
|
||||
```assembly
|
||||
#if __SYSCALL_MASK == ~0
|
||||
cmpq $__NR_syscall_max, %rax
|
||||
#else
|
||||
andl $__SYSCALL_MASK, %eax
|
||||
cmpl $__NR_syscall_max, %eax
|
||||
#endif
|
||||
```
|
||||
|
||||
After this we check the result of the last comparison with the `ja` instruction that executes if `CF` and `ZF` flags are zero:
|
||||
|
||||
```assembly
|
||||
ja 1f
|
||||
```
|
||||
|
||||
and if we have the correct system call for this, we move the fourth argument from the `r10` to the `rcx` to keep [x86_64 C ABI](http://www.x86-64.org/documentation/abi.pdf) compliant and execute the `call` instruction with the address of a system call handler:
|
||||
|
||||
```assembly
|
||||
movq %r10, %rcx
|
||||
call *sys_call_table(, %rax, 8)
|
||||
```
|
||||
|
||||
Note, the `sys_call_table` is an array that we saw above in this part. As we already know the `rax` general purpose register contains the number of a system call and each element of the `sys_call_table` is 8-bytes. So we are using `*sys_call_table(, %rax, 8)` this notation to find the correct offset in the `sys_call_table` array for the given system call handler.
|
||||
|
||||
That's all. We did all the required preparations and the system call handler was called for the given interrupt handler, for example `sys_read`, `sys_write` or other system call handler that is defined with the `SYSCALL_DEFINE[N]` macro in the Linux kernel code.
|
||||
|
||||
Exit from a system call
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After a system call handler finishes its work, we will return back to the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S), right after where we have called the system call handler:
|
||||
|
||||
```assembly
|
||||
call *sys_call_table(, %rax, 8)
|
||||
```
|
||||
|
||||
The next step after we've returned from a system call handler is to put the return value of a system handler on to the stack. We know that a system call returns the result to the user program in the general purpose `rax` register, so we are moving its value on to the stack after the system call handler has finished its work:
|
||||
|
||||
```C
|
||||
movq %rax, RAX(%rsp)
|
||||
```
|
||||
|
||||
on the `RAX` place.
|
||||
|
||||
After this we can see the call of the `LOCKDEP_SYS_EXIT` macro from the [arch/x86/include/asm/irqflags.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irqflags.h):
|
||||
|
||||
```assembly
|
||||
LOCKDEP_SYS_EXIT
|
||||
```
|
||||
|
||||
The implementation of this macro depends on the `CONFIG_DEBUG_LOCK_ALLOC` kernel configuration option that allows us to debug locks on exit from a system call. And again, we will not consider it in this chapter, but will return to it in a separate one. In the end of the `entry_SYSCALL_64` function we restore all general purpose registers besides `rxc` and `r11`, because the `rcx` register must contain the return address to the application that called system call and the `r11` register contains the old [flags register](https://en.wikipedia.org/wiki/FLAGS_register). After all general purpose registers are restored, we fill `rcx` with the return address, `r11` register with the flags and `rsp` with the old stack pointer:
|
||||
|
||||
```assembly
|
||||
RESTORE_C_REGS_EXCEPT_RCX_R11
|
||||
|
||||
movq RIP(%rsp), %rcx
|
||||
movq EFLAGS(%rsp), %r11
|
||||
movq RSP(%rsp), %rsp
|
||||
|
||||
USERGS_SYSRET64
|
||||
```
|
||||
|
||||
In the end we just call the `USERGS_SYSRET64` macro that expands to the call of the `swapgs` instruction which exchanges again the user `GS` and kernel `GS` and the `sysretq` instruction which executes on exit from a system call handler:
|
||||
|
||||
```C
|
||||
#define USERGS_SYSRET64 \
|
||||
swapgs; \
|
||||
sysretq;
|
||||
```
|
||||
|
||||
Now we know what occurs when a user application calls a system call. The full path of this process is as follows:
|
||||
|
||||
* User application contains code that fills general purpose register with the values (system call number and arguments of this system call);
|
||||
* Processor switches from the user mode to kernel mode and starts execution of the system call entry - `entry_SYSCALL_64`;
|
||||
* `entry_SYSCALL_64` switches to the kernel stack and saves some general purpose registers, old stack and code segment, flags and etc... on the stack;
|
||||
* `entry_SYSCALL_64` checks the system call number in the `rax` register, searches a system call handler in the `sys_call_table` and calls it, if the number of a system call is correct;
|
||||
* If a system call is not correct, jump on exit from system call;
|
||||
* After a system call handler will finish its work, restore general purpose registers, old stack, flags and return address and exit from the `entry_SYSCALL_64` with the `sysretq` instruction.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the second part about the system calls concept in the Linux kernel. In the previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) we saw theory about this concept from the user application view. In this part we continued to dive into the stuff which is related to the system call concept and saw what the Linux kernel does when a system call occurs.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [system call](https://en.wikipedia.org/wiki/System_call)
|
||||
* [write](http://man7.org/linux/man-pages/man2/write.2.html)
|
||||
* [C standard library](https://en.wikipedia.org/wiki/GNU_C_Library)
|
||||
* [list of cpu architectures](https://en.wikipedia.org/wiki/List_of_CPU_architectures)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [kbuild](https://www.kernel.org/doc/Documentation/kbuild/makefiles.txt)
|
||||
* [typedef](https://en.wikipedia.org/wiki/Typedef)
|
||||
* [errno](http://man7.org/linux/man-pages/man3/errno.3.html)
|
||||
* [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)
|
||||
* [model specific register](https://en.wikipedia.org/wiki/Model-specific_register)
|
||||
* [intel 2b manual](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html)
|
||||
* [coprocessor](https://en.wikipedia.org/wiki/Coprocessor)
|
||||
* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter)
|
||||
* [flags register](https://en.wikipedia.org/wiki/FLAGS_register)
|
||||
* [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table)
|
||||
* [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [general purpose registers](https://en.wikipedia.org/wiki/Processor_register)
|
||||
* [ABI](https://en.wikipedia.org/wiki/Application_binary_interface)
|
||||
* [x86_64 C ABI](http://www.x86-64.org/documentation/abi.pdf)
|
||||
* [previous chapter](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html)
|
@ -0,0 +1,403 @@
|
||||
System calls in the Linux kernel. Part 3.
|
||||
================================================================================
|
||||
|
||||
vsyscalls and vDSO
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the third part of the [chapter](http://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) that describes system calls in the Linux kernel and we saw preparations after a system call caused by an userspace application and process of handling of a system call in the previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html). In this part we will look at two concepts that are very close to the system call concept, they are called `vsyscall` and `vdso`.
|
||||
|
||||
We already know what is a `system call`. This is special routine in the Linux kernel which userspace application asks to do privileged tasks, like to read or to write to a file, to open a socket and etc. As you may know, invoking a system call is an expensive operation in the Linux kernel, because the processor must interrupt the currently executing task and switch context to kernel mode, subsequently jumping again into userspace after the system call handler finishes its work. These two mechanisms - `vsyscall` and `vdso` are designed to speed up this process for certain system calls and in this part we will try to understand how these mechanisms work.
|
||||
|
||||
Introduction to vsyscalls
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The `vsyscall` or `virtual system call` is the first and oldest mechanism in the Linux kernel that is designed to accelerate execution of certain system calls. The principle of work of the `vsyscall` concept is simple. The Linux kernel maps into user space a page that contains some variables and the implementation of some system calls. We can find information about this memory space in the Linux kernel [documentation](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt) for the [x86_64](https://en.wikipedia.org/wiki/X86-64):
|
||||
|
||||
```
|
||||
ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
|
||||
```
|
||||
|
||||
or:
|
||||
|
||||
```
|
||||
~$ sudo cat /proc/1/maps | grep vsyscall
|
||||
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
|
||||
```
|
||||
|
||||
After this, these system calls will be executed in userspace and this means that there will not be [context switching](https://en.wikipedia.org/wiki/Context_switch). Mapping of the `vsyscall` page occurs in the `map_vsyscall` function that is defined in the [arch/x86/entry/vsyscall/vsyscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_64.c) source code file. This function is called during the Linux kernel initialization in the `setup_arch` function that is defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file (we saw this function in the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) of the Linux kernel initialization process chapter).
|
||||
|
||||
Note that implementation of the `map_vsyscall` function depends on the `CONFIG_X86_VSYSCALL_EMULATION` kernel configuration option:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_X86_VSYSCALL_EMULATION
|
||||
extern void map_vsyscall(void);
|
||||
#else
|
||||
static inline void map_vsyscall(void) {}
|
||||
#endif
|
||||
```
|
||||
|
||||
As we can read in the help text, the `CONFIG_X86_VSYSCALL_EMULATION` configuration option: `Enable vsyscall emulation`. Why emulate `vsyscall`? Actually, the `vsyscall` is a legacy [ABI](https://en.wikipedia.org/wiki/Application_binary_interface) due to security reasons. Virtual system calls have fixed addresses, meaning that `vsyscall` page is still at the same location every time and the location of this page is determined in the `map_vsyscall` function. Let's look on the implementation of this function:
|
||||
|
||||
```C
|
||||
void __init map_vsyscall(void)
|
||||
{
|
||||
extern char __vsyscall_page;
|
||||
unsigned long physaddr_vsyscall = __pa_symbol(&__vsyscall_page);
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
As we can see, at the beginning of the `map_vsyscall` function we get the physical address of the `vsyscall` page with the `__pa_symbol` macro (we already saw implementation if this macro in the fourth [path](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process). The `__vsyscall_page` symbol defined in the [arch/x86/entry/vsyscall/vsyscall_emu_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_emu_64.S) assembly source code file and have the following [virtual address](https://en.wikipedia.org/wiki/Virtual_address_space):
|
||||
|
||||
```
|
||||
ffffffff81881000 D __vsyscall_page
|
||||
```
|
||||
|
||||
in the `.data..page_aligned, aw` [section](https://en.wikipedia.org/wiki/Memory_segmentation) and contains call of the three following system calls:
|
||||
|
||||
* `gettimeofday`;
|
||||
* `time`;
|
||||
* `getcpu`.
|
||||
|
||||
Or:
|
||||
|
||||
```assembly
|
||||
__vsyscall_page:
|
||||
mov $__NR_gettimeofday, %rax
|
||||
syscall
|
||||
ret
|
||||
|
||||
.balign 1024, 0xcc
|
||||
mov $__NR_time, %rax
|
||||
syscall
|
||||
ret
|
||||
|
||||
.balign 1024, 0xcc
|
||||
mov $__NR_getcpu, %rax
|
||||
syscall
|
||||
ret
|
||||
```
|
||||
|
||||
Let's go back to the implementation of the `map_vsyscall` function and return to the implementation of the `__vsyscall_page`, later. After we receiving the physical address of the `__vsyscall_page`, we check the value of the `vsyscall_mode` variable and set the [fix-mapped](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html) address for the `vsyscall` page with the `__set_fixmap` macro:
|
||||
|
||||
```C
|
||||
if (vsyscall_mode != NONE)
|
||||
__set_fixmap(VSYSCALL_PAGE, physaddr_vsyscall,
|
||||
vsyscall_mode == NATIVE
|
||||
? PAGE_KERNEL_VSYSCALL
|
||||
: PAGE_KERNEL_VVAR);
|
||||
```
|
||||
|
||||
The `__set_fixmap` takes three arguments: The first is index of the `fixed_addresses` [enum](https://en.wikipedia.org/wiki/Enumerated_type). In our case `VSYSCALL_PAGE` is the first element of the `fixed_addresses` enum for the `x86_64` architecture:
|
||||
|
||||
```C
|
||||
enum fixed_addresses {
|
||||
...
|
||||
...
|
||||
...
|
||||
#ifdef CONFIG_X86_VSYSCALL_EMULATION
|
||||
VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT,
|
||||
#endif
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
It equal to the `511`. The second argument is the physical address of the page that has to be mapped and the third argument is the flags of the page. Note that the flags of the `VSYSCALL_PAGE` depend on the `vsyscall_mode` variable. It will be `PAGE_KERNEL_VSYSCALL` if the `vsyscall_mode` variable is `NATIVE` and the `PAGE_KERNEL_VVAR` otherwise. Both macros (the `PAGE_KERNEL_VSYSCALL` and the `PAGE_KERNEL_VVAR`) will be expanded to the following flags:
|
||||
|
||||
```C
|
||||
#define __PAGE_KERNEL_VSYSCALL (__PAGE_KERNEL_RX | _PAGE_USER)
|
||||
#define __PAGE_KERNEL_VVAR (__PAGE_KERNEL_RO | _PAGE_USER)
|
||||
```
|
||||
|
||||
that represent access rights to the `vsyscall` page. Both flags have the same `_PAGE_USER` flags that means that the page can be accessed by a user-mode process running at lower privilege levels. The second flag depends on the value of the `vsyscall_mode` variable. The first flag (`__PAGE_KERNEL_VSYSCALL`) will be set in the case where `vsyscall_mode` is `NATIVE`. This means virtual system calls will be native `syscall` instructions. In other way the vsyscall will have `PAGE_KERNEL_VVAR` if the `vsyscall_mode` variable will be `emulate`. In this case virtual system calls will be turned into traps and are emulated reasonably. The `vsyscall_mode` variable gets its value in the `vsyscall_setup` function:
|
||||
|
||||
```C
|
||||
static int __init vsyscall_setup(char *str)
|
||||
{
|
||||
if (str) {
|
||||
if (!strcmp("emulate", str))
|
||||
vsyscall_mode = EMULATE;
|
||||
else if (!strcmp("native", str))
|
||||
vsyscall_mode = NATIVE;
|
||||
else if (!strcmp("none", str))
|
||||
vsyscall_mode = NONE;
|
||||
else
|
||||
return -EINVAL;
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
return -EINVAL;
|
||||
}
|
||||
```
|
||||
|
||||
That will be called during early kernel parameters parsing:
|
||||
|
||||
```C
|
||||
early_param("vsyscall", vsyscall_setup);
|
||||
```
|
||||
|
||||
More about `early_param` macro you can read in the sixth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the chapter that describes process of the initialization of the Linux kernel.
|
||||
|
||||
In the end of the `vsyscall_map` function we just check that virtual address of the `vsyscall` page is equal to the value of the `VSYSCALL_ADDR` with the [BUILD_BUG_ON](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) macro:
|
||||
|
||||
```C
|
||||
BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
|
||||
(unsigned long)VSYSCALL_ADDR);
|
||||
```
|
||||
|
||||
That's all. `vsyscall` page is set up. The result of the all the above is the following: If we pass `vsyscall=native` parameter to the kernel command line, virtual system calls will be handled as native `syscall` instructions in the [arch/x86/entry/vsyscall/vsyscall_emu_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_emu_64.S). The [glibc](https://en.wikipedia.org/wiki/GNU_C_Library) knows addresses of the virtual system call handlers. Note that virtual system call handlers are aligned by `1024` (or `0x400`) bytes:
|
||||
|
||||
```assembly
|
||||
__vsyscall_page:
|
||||
mov $__NR_gettimeofday, %rax
|
||||
syscall
|
||||
ret
|
||||
|
||||
.balign 1024, 0xcc
|
||||
mov $__NR_time, %rax
|
||||
syscall
|
||||
ret
|
||||
|
||||
.balign 1024, 0xcc
|
||||
mov $__NR_getcpu, %rax
|
||||
syscall
|
||||
ret
|
||||
```
|
||||
|
||||
And the start address of the `vsyscall` page is the `ffffffffff600000` every time. So, the [glibc](https://en.wikipedia.org/wiki/GNU_C_Library) knows the addresses of the all virtual system call handlers. You can find definition of these addresses in the `glibc` source code:
|
||||
|
||||
```C
|
||||
#define VSYSCALL_ADDR_vgettimeofday 0xffffffffff600000
|
||||
#define VSYSCALL_ADDR_vtime 0xffffffffff600400
|
||||
#define VSYSCALL_ADDR_vgetcpu 0xffffffffff600800
|
||||
```
|
||||
|
||||
All virtual system call requests will fall into the `__vsyscall_page` + `VSYSCALL_ADDR_vsyscall_name` offset, put the number of a virtual system call to the `rax` general purpose [register](https://en.wikipedia.org/wiki/Processor_register) and the native for the x86_64 `syscall` instruction will be executed.
|
||||
|
||||
In the second case, if we pass `vsyscall=emulate` parameter to the kernel command line, an attempt to perform virtual system call handler will cause a [page fault](https://en.wikipedia.org/wiki/Page_fault) exception. Of course, remember, the `vsyscall` page has `__PAGE_KERNEL_VVAR` access rights that forbid execution. The `do_page_fault` function is the `#PF` or page fault handler. It tries to understand the reason of the last page fault. And one of the reason can be situation when virtual system call called and `vsyscall` mode is `emulate`. In this case `vsyscall` will be handled by the `emulate_vsyscall` function that defined in the [arch/x86/entry/vsyscall/vsyscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_64.c) source code file.
|
||||
|
||||
The `emulate_vsyscall` function gets the number of a virtual system call, checks it, prints error and sends [segmentation fault](https://en.wikipedia.org/wiki/Segmentation_fault) single:
|
||||
|
||||
```C
|
||||
...
|
||||
...
|
||||
...
|
||||
vsyscall_nr = addr_to_vsyscall_nr(address);
|
||||
if (vsyscall_nr < 0) {
|
||||
warn_bad_vsyscall(KERN_WARNING, regs, "misaligned vsyscall...);
|
||||
goto sigsegv;
|
||||
}
|
||||
...
|
||||
...
|
||||
...
|
||||
sigsegv:
|
||||
force_sig(SIGSEGV, current);
|
||||
reutrn true;
|
||||
```
|
||||
|
||||
As it checked number of a virtual system call, it does some yet another checks like `access_ok` violations and execute system call function depends on the number of a virtual system call:
|
||||
|
||||
```C
|
||||
switch (vsyscall_nr) {
|
||||
case 0:
|
||||
ret = sys_gettimeofday(
|
||||
(struct timeval __user *)regs->di,
|
||||
(struct timezone __user *)regs->si);
|
||||
break;
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
In the end we put the result of the `sys_gettimeofday` or another virtual system call handler to the `ax` general purpose register, as we did it with the normal system calls and restore the [instruction pointer](https://en.wikipedia.org/wiki/Program_counter) register and add `8` bytes to the [stack pointer](https://en.wikipedia.org/wiki/Stack_register) register. This operation emulates `ret` instruction.
|
||||
|
||||
```C
|
||||
regs->ax = ret;
|
||||
|
||||
do_ret:
|
||||
regs->ip = caller;
|
||||
regs->sp += 8;
|
||||
return true;
|
||||
```
|
||||
|
||||
That's all. Now let's look on the modern concept - `vDSO`.
|
||||
|
||||
Introduction to vDSO
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As I already wrote above, `vsyscall` is an obsolete concept and replaced by the `vDSO` or `virtual dynamic shared object`. The main difference between the `vsyscall` and `vDSO` mechanisms is that `vDSO` maps memory pages into each process in a shared object [form](https://en.wikipedia.org/wiki/Library_%28computing%29#Shared_libraries), but `vsyscall` is static in memory and has the same address every time. For the `x86_64` architecture it is called -`linux-vdso.so.1`. All userspace applications linked with this shared library via the `glibc`. For example:
|
||||
|
||||
```
|
||||
~$ ldd /bin/uname
|
||||
linux-vdso.so.1 (0x00007ffe014b7000)
|
||||
libc.so.6 => /lib64/libc.so.6 (0x00007fbfee2fe000)
|
||||
/lib64/ld-linux-x86-64.so.2 (0x00005559aab7c000)
|
||||
```
|
||||
|
||||
Or:
|
||||
|
||||
```
|
||||
~$ sudo cat /proc/1/maps | grep vdso
|
||||
7fff39f73000-7fff39f75000 r-xp 00000000 00:00 0 [vdso]
|
||||
```
|
||||
|
||||
Here we can see that [uname](https://en.wikipedia.org/wiki/Uname) util was linked with the three libraries:
|
||||
|
||||
* `linux-vdso.so.1`;
|
||||
* `libc.so.6`;
|
||||
* `ld-linux-x86-64.so.2`.
|
||||
|
||||
The first provides `vDSO` functionality, the second is `C` [standard library](https://en.wikipedia.org/wiki/C_standard_library) and the third is the program interpreter (more about this you can read in the part that describes [linkers](http://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html)). So, the `vDSO` solves limitations of the `vsyscall`. Implementation of the `vDSO` is similar to `vsyscall`.
|
||||
|
||||
Initialization of the `vDSO` occurs in the `init_vdso` function that defined in the [arch/x86/entry/vdso/vma.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vma.c) source code file. This function starts from the initialization of the `vDSO` images for 32-bits and 64-bits depends on the `CONFIG_X86_X32_ABI` kernel configuration option:
|
||||
|
||||
```C
|
||||
static int __init init_vdso(void)
|
||||
{
|
||||
init_vdso_image(&vdso_image_64);
|
||||
|
||||
#ifdef CONFIG_X86_X32_ABI
|
||||
init_vdso_image(&vdso_image_x32);
|
||||
#endif
|
||||
```
|
||||
|
||||
Both function initialize the `vdso_image` structure. This structure is defined in the two generated source code files: the [arch/x86/entry/vdso/vdso-image-64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vdso-image-64.c) and the [arch/x86/entry/vdso/vdso-image-64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vdso-image-64.c). These source code files generated by the [vdso2c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vdso2c.c) program from the different source code files, represent different approaches to call a system call like `int 0x80`, `sysenter` and etc. The full set of the images depends on the kernel configuration.
|
||||
|
||||
For example for the `x86_64` Linux kernel it will contain `vdso_image_64`:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_X86_64
|
||||
extern const struct vdso_image vdso_image_64;
|
||||
#endif
|
||||
```
|
||||
|
||||
But for the `x86` - `vdso_image_32`:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_X86_X32
|
||||
extern const struct vdso_image vdso_image_x32;
|
||||
#endif
|
||||
```
|
||||
|
||||
If our kernel is configured for the `x86` architecture or for the `x86_64` and compatibility mode, we will have ability to call a system call with the `int 0x80` interrupt, if compatibility mode is enabled, we will be able to call a system call with the native `syscall instruction` or `sysenter` instruction in other way:
|
||||
|
||||
```C
|
||||
#if defined CONFIG_X86_32 || defined CONFIG_COMPAT
|
||||
extern const struct vdso_image vdso_image_32_int80;
|
||||
#ifdef CONFIG_COMPAT
|
||||
extern const struct vdso_image vdso_image_32_syscall;
|
||||
#endif
|
||||
extern const struct vdso_image vdso_image_32_sysenter;
|
||||
#endif
|
||||
```
|
||||
|
||||
As we can understand from the name of the `vdso_image` structure, it represents image of the `vDSO` for the certain mode of the system call entry. This structure contains information about size in bytes of the `vDSO` area that always a multiple of `PAGE_SIZE` (`4096` bytes), pointer to the text mapping, start and end address of the `alternatives` (set of instructions with better alternatives for the certain type of the processor) and etc. For example `vdso_image_64` looks like this:
|
||||
|
||||
```C
|
||||
const struct vdso_image vdso_image_64 = {
|
||||
.data = raw_data,
|
||||
.size = 8192,
|
||||
.text_mapping = {
|
||||
.name = "[vdso]",
|
||||
.pages = pages,
|
||||
},
|
||||
.alt = 3145,
|
||||
.alt_len = 26,
|
||||
.sym_vvar_start = -8192,
|
||||
.sym_vvar_page = -8192,
|
||||
.sym_hpet_page = -4096,
|
||||
};
|
||||
```
|
||||
|
||||
Where the `raw_data` contains raw binary code of the 64-bit `vDSO` system calls which are `2` page size:
|
||||
|
||||
```C
|
||||
static struct page *pages[2];
|
||||
```
|
||||
|
||||
or 8 Kilobytes.
|
||||
|
||||
The `init_vdso_image` function is defined in the same source code file and just initializes the `vdso_image.text_mapping.pages`. First of all this function calculates the number of pages and initializes each `vdso_image.text_mapping.pages[number_of_page]` with the `virt_to_page` macro that converts given address to the `page` structure:
|
||||
|
||||
```C
|
||||
void __init init_vdso_image(const struct vdso_image *image)
|
||||
{
|
||||
int i;
|
||||
int npages = (image->size) / PAGE_SIZE;
|
||||
|
||||
for (i = 0; i < npages; i++)
|
||||
image->text_mapping.pages[i] =
|
||||
virt_to_page(image->data + i*PAGE_SIZE);
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
The `init_vdso` function passed to the `subsys_initcall` macro adds the given function to the `initcalls` list. All functions from this list will be called in the `do_initcalls` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file:
|
||||
|
||||
```C
|
||||
subsys_initcall(init_vdso);
|
||||
```
|
||||
|
||||
Ok, we just saw initialization of the `vDSO` and initialization of `page` structures that are related to the memory pages that contain `vDSO` system calls. But to where do their pages map? Actually they are mapped by the kernel, when it loads binary to the memory. The Linux kernel calls the `arch_setup_additional_pages` function from the [arch/x86/entry/vdso/vma.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vma.c) source code file that checks that `vDSO` enabled for the `x86_64` and calls the `map_vdso` function:
|
||||
|
||||
```C
|
||||
int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
|
||||
{
|
||||
if (!vdso64_enabled)
|
||||
return 0;
|
||||
|
||||
return map_vdso(&vdso_image_64, true);
|
||||
}
|
||||
```
|
||||
|
||||
The `map_vdso` function is defined in the same source code file and maps pages for the `vDSO` and for the shared `vDSO` variables. That's all. The main differences between the `vsyscall` and the `vDSO` concepts is that `vsyscal` has a static address of `ffffffffff600000` and implements `3` system calls, whereas the `vDSO` loads dynamically and implements four system calls:
|
||||
|
||||
* `__vdso_clock_gettime`;
|
||||
* `__vdso_getcpu`;
|
||||
* `__vdso_gettimeofday`;
|
||||
* `__vdso_time`.
|
||||
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the third part about the system calls concept in the Linux kernel. In the previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) we discussed the implementation of the preparation from the Linux kernel side, before a system call will be handled and implementation of the `exit` process from a system call handler. In this part we continued to dive into the stuff which is related to the system call concept and learned two new concepts that are very similar to the system call - the `vsyscall` and the `vDSO`.
|
||||
|
||||
After all of these three parts, we know almost all things that are related to system calls, we know what system call is and why user applications need them. We also know what occurs when a user application calls a system call and how the kernel handles system calls.
|
||||
|
||||
The next part will be the last part in this [chapter](http://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) and we will see what occurs when a user runs the program.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [x86_64 memory map](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [context switching](https://en.wikipedia.org/wiki/Context_switch)
|
||||
* [ABI](https://en.wikipedia.org/wiki/Application_binary_interface)
|
||||
* [virtual address](https://en.wikipedia.org/wiki/Virtual_address_space)
|
||||
* [Segmentation](https://en.wikipedia.org/wiki/Memory_segmentation)
|
||||
* [enum](https://en.wikipedia.org/wiki/Enumerated_type)
|
||||
* [fix-mapped addresses](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html)
|
||||
* [glibc](https://en.wikipedia.org/wiki/GNU_C_Library)
|
||||
* [BUILD_BUG_ON](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html)
|
||||
* [Processor register](https://en.wikipedia.org/wiki/Processor_register)
|
||||
* [Page fault](https://en.wikipedia.org/wiki/Page_fault)
|
||||
* [segmentation fault](https://en.wikipedia.org/wiki/Segmentation_fault)
|
||||
* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter)
|
||||
* [stack pointer](https://en.wikipedia.org/wiki/Stack_register)
|
||||
* [uname](https://en.wikipedia.org/wiki/Uname)
|
||||
* [Linkers](http://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html)
|
@ -0,0 +1,430 @@
|
||||
System calls in the Linux kernel. Part 4.
|
||||
================================================================================
|
||||
|
||||
How does the Linux kernel run a program
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the fourth part of the [chapter](http://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) that describes [system calls](https://en.wikipedia.org/wiki/System_call) in the Linux kernel and as I wrote in the conclusion of the [previous](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-3.html) - this part will be last in this chapter. In the previous part we stopped at the two new concepts:
|
||||
|
||||
* `vsyscall`;
|
||||
* `vDSO`;
|
||||
|
||||
that are related and very similar on system call concept.
|
||||
|
||||
This part will be last part in this chapter and as you can understand from the part's title - we will see what does occur in the Linux kernel when we run our programs. So, let's start.
|
||||
|
||||
how do we launch our programs?
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
There are many different ways to launch an application from a user perspective. For example we can run a program from the [shell](https://en.wikipedia.org/wiki/Unix_shell) or double-click on the application icon. It does not matter. The Linux kernel handles application launch regardless how we do launch this application.
|
||||
|
||||
In this part we will consider the way when we just launch an application from the shell. As you know, the standard way to launch an application from shell is the following: We just launch a [terminal emulator](https://en.wikipedia.org/wiki/Terminal_emulator) application and just write the name of the program and pass or not arguments to our program, for example:
|
||||
|
||||
![ls shell](http://s14.postimg.org/d6jgidc7l/Screenshot_from_2015_09_07_17_31_55.png)
|
||||
|
||||
Let's consider what does occur when we launch an application from the shell, what does shell do when we write program name, what does Linux kernel do etc. But before we will start to consider these interesting things, I want to warn that this book is about the Linux kernel. That's why we will see Linux kernel insides related stuff mostly in this part. We will not consider in details what does shell do, we will not consider complex cases, for example subshells etc.
|
||||
|
||||
My default shell is - [bash](https://en.wikipedia.org/wiki/Bash_%28Unix_shell%29), so I will consider how do bash shell launches a program. So let's start. The `bash` shell as well as any program that written with [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) programming language starts from the [main](https://en.wikipedia.org/wiki/Entry_point) function. If you will look on the source code of the `bash` shell, you will find the `main` function in the [shell.c](https://github.com/bminor/bash/blob/master/shell.c#L357) source code file. This function makes many different things before the main thread loop of the `bash` started to work. For example this function:
|
||||
|
||||
* checks and tries to open `/dev/tty`;
|
||||
* check that shell running in debug mode;
|
||||
* parses command line arguments;
|
||||
* reads shell environment;
|
||||
* loads `.bashrc`, `.profile` and other configuration files;
|
||||
* and many many more.
|
||||
|
||||
After all of these operations we can see the call of the `reader_loop` function. This function defined in the [eval.c](https://github.com/bminor/bash/blob/master/eval.c#L67) source code file and represents main thread loop or in other words it reads and executes commands. As the `reader_loop` function made all checks and read the given program name and arguments, it calls the `execute_command` function from the [execute_cmd.c](https://github.com/bminor/bash/blob/master/execute_cmd.c#L378) source code file. The `execute_command` function through the chain of the functions calls:
|
||||
|
||||
```
|
||||
execute_command
|
||||
--> execute_command_internal
|
||||
----> execute_simple_command
|
||||
------> execute_disk_command
|
||||
--------> shell_execve
|
||||
```
|
||||
|
||||
makes different checks like do we need to start `subshell`, was it builtin `bash` function or not etc. As I already wrote above, we will not consider all details about things that are not related to the Linux kernel. In the end of this process, the `shell_execve` function calls the `execve` system call:
|
||||
|
||||
```C
|
||||
execve (command, args, env);
|
||||
```
|
||||
|
||||
The `execve` system call has the following signature:
|
||||
|
||||
```
|
||||
int execve(const char *filename, char *const argv [], char *const envp[]);
|
||||
```
|
||||
|
||||
and executes a program by the given filename, with the given arguments and [environment variables](https://en.wikipedia.org/wiki/Environment_variable). This system call is the first in our case and only, for example:
|
||||
|
||||
```
|
||||
$ strace ls
|
||||
execve("/bin/ls", ["ls"], [/* 62 vars */]) = 0
|
||||
|
||||
$ strace echo
|
||||
execve("/bin/echo", ["echo"], [/* 62 vars */]) = 0
|
||||
|
||||
$ strace uname
|
||||
execve("/bin/uname", ["uname"], [/* 62 vars */]) = 0
|
||||
```
|
||||
|
||||
So, a user application (`bash` in our case) calls the system call and as we already know the next step is Linux kernel.
|
||||
|
||||
execve system call
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
We saw preparation before a system call called by a user application and after a system call handler finished its work in the second [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) of this chapter. We stopped at the call of the `execve` system call in the previous paragraph. This system call defined in the [fs/exec.c](https://github.com/torvalds/linux/blob/master/fs/exec.c) source code file and as we already know it takes three arguments:
|
||||
|
||||
```
|
||||
SYSCALL_DEFINE3(execve,
|
||||
const char __user *, filename,
|
||||
const char __user *const __user *, argv,
|
||||
const char __user *const __user *, envp)
|
||||
{
|
||||
return do_execve(getname(filename), argv, envp);
|
||||
}
|
||||
```
|
||||
|
||||
Implementation of the `execve` is pretty simple here, as we can see it just returns the result of the `do_execve` function. The `do_execve` function defined in the same source code file and do the following things:
|
||||
|
||||
* Initialize two pointers on a userspace data with the given arguments and environment variables;
|
||||
* return the result of the `do_execveat_common`.
|
||||
|
||||
We can see its implementation:
|
||||
|
||||
```C
|
||||
struct user_arg_ptr argv = { .ptr.native = __argv };
|
||||
struct user_arg_ptr envp = { .ptr.native = __envp };
|
||||
return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
|
||||
```
|
||||
|
||||
The `do_execveat_common` function does main work - it executes a new program. This function takes similar set of arguments, but as you can see it takes five arguments instead of three. The first argument is the file descriptor that represent directory with our application, in our case the `AT_FDCWD` means that the given pathname is interpreted relative to the current working directory of the calling process. The fifth argument is flags. In our case we passed `0` to the `do_execveat_common`. We will check in a next step, so will see it latter.
|
||||
|
||||
First of all the `do_execveat_common` function checks the `filename` pointer and returns if it is `NULL`. After this we check flags of the current process that limit of running processes is not exceed:
|
||||
|
||||
```C
|
||||
if (IS_ERR(filename))
|
||||
return PTR_ERR(filename);
|
||||
|
||||
if ((current->flags & PF_NPROC_EXCEEDED) &&
|
||||
atomic_read(¤t_user()->processes) > rlimit(RLIMIT_NPROC)) {
|
||||
retval = -EAGAIN;
|
||||
goto out_ret;
|
||||
}
|
||||
|
||||
current->flags &= ~PF_NPROC_EXCEEDED;
|
||||
```
|
||||
|
||||
If these two checks were successful we unset `PF_NPROC_EXCEEDED` flag in the flags of the current process to prevent fail of the `execve`. You can see that in the next step we call the `unshare_files` function that defined in the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c) and unshares the files of the current task and check the result of this function:
|
||||
|
||||
```C
|
||||
retval = unshare_files(&displaced);
|
||||
if (retval)
|
||||
goto out_ret;
|
||||
```
|
||||
|
||||
We need to call this function to eliminate potential leak of the execve'd binary's [file descriptor](https://en.wikipedia.org/wiki/File_descriptor). In the next step we start preparation of the `bprm` that represented by the `struct linux_binprm` structure (defined in the [include/linux/binfmts.h](https://github.com/torvalds/linux/blob/master/linux/binfmts.h) header file). The `linux_binprm` structure is used to hold the arguments that are used when loading binaries. For example it contains `vma` field which has `vm_area_struct` type and represents single memory area over a contiguous interval in a given address space where our application will be loaded, `mm` field which is memory descriptor of the binary, pointer to the top of memory and many other different fields.
|
||||
|
||||
First of all we allocate memory for this structure with the `kzalloc` function and check the result of the allocation:
|
||||
|
||||
```C
|
||||
bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
|
||||
if (!bprm)
|
||||
goto out_files;
|
||||
```
|
||||
|
||||
After this we start to prepare the `binprm` credentials with the call of the `prepare_bprm_creds` function:
|
||||
|
||||
```C
|
||||
retval = prepare_bprm_creds(bprm);
|
||||
if (retval)
|
||||
goto out_free;
|
||||
|
||||
check_unsafe_exec(bprm);
|
||||
current->in_execve = 1;
|
||||
```
|
||||
|
||||
Initialization of the `binprm` credentials in other words is initialization of the `cred` structure that stored inside of the `linux_binprm` structure. The `cred` structure contains the security context of a task for example [real uid](https://en.wikipedia.org/wiki/User_identifier#Real_user_ID) of the task, real [guid](https://en.wikipedia.org/wiki/Globally_unique_identifier) of the task, `uid` and `guid` for the [virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system) operations etc. In the next step as we executed preparation of the `bprm` credentials we check that now we can safely execute a program with the call of the `check_unsafe_exec` function and set the current process to the `in_execve` state.
|
||||
|
||||
After all of these operations we call the `do_open_execat` function that checks the flags that we passed to the `do_execveat_common` function (remember that we have `0` in the `flags`) and searches and opens executable file on disk, checks that our we will load a binary file from `noexec` mount points (we need to avoid execute a binary from filesystems that do not contain executable binaries like [proc](https://en.wikipedia.org/wiki/Procfs) or [sysfs](https://en.wikipedia.org/wiki/Sysfs)), initializes `file` structure and returns pointer on this structure. Next we can see the call the `sched_exec` after this:
|
||||
|
||||
```C
|
||||
file = do_open_execat(fd, filename, flags);
|
||||
retval = PTR_ERR(file);
|
||||
if (IS_ERR(file))
|
||||
goto out_unmark;
|
||||
|
||||
sched_exec();
|
||||
```
|
||||
|
||||
The `sched_exec` function is used to determine the least loaded processor that can execute the new program and to migrate the current process to it.
|
||||
|
||||
After this we need to check [file descriptor](https://en.wikipedia.org/wiki/File_descriptor) of the give executable binary. We try to check does the name of the our binary file starts from the `/` symbol or does the path of the given executable binary is interpreted relative to the current working directory of the calling process or in other words file descriptor is `AT_FDCWD` (read above about this).
|
||||
|
||||
If one of these checks is successful we set the binary parameter filename:
|
||||
|
||||
```C
|
||||
bprm->file = file;
|
||||
|
||||
if (fd == AT_FDCWD || filename->name[0] == '/') {
|
||||
bprm->filename = filename->name;
|
||||
}
|
||||
```
|
||||
|
||||
Otherwise if the filename is empty we set the binary parameter filename to the `/dev/fd/%d` or `/dev/fd/%d/%s` depends on the filename of the given executable binary which means that we will execute the file to which the file descriptor refers:
|
||||
|
||||
```C
|
||||
} else {
|
||||
if (filename->name[0] == '\0')
|
||||
pathbuf = kasprintf(GFP_TEMPORARY, "/dev/fd/%d", fd);
|
||||
else
|
||||
pathbuf = kasprintf(GFP_TEMPORARY, "/dev/fd/%d/%s",
|
||||
fd, filename->name);
|
||||
if (!pathbuf) {
|
||||
retval = -ENOMEM;
|
||||
goto out_unmark;
|
||||
}
|
||||
|
||||
bprm->filename = pathbuf;
|
||||
}
|
||||
|
||||
bprm->interp = bprm->filename;
|
||||
```
|
||||
|
||||
Note that we set not only the `bprm->filename` but also `bprm->interp` that will contain name of the program interpreter. For now we just write the same name there, but later it will be updated with the real name of the program interpreter depends on binary format of a program. You can read above that we already prepared `cred` for the `linux_binprm`. The next step is initialization of other fields of the `linux_binprm`. First of all we call the `bprm_mm_init` function and pass the `bprm` to it:
|
||||
|
||||
```C
|
||||
retval = bprm_mm_init(bprm);
|
||||
if (retval)
|
||||
goto out_unmark;
|
||||
```
|
||||
|
||||
The `bprm_mm_init` defined in the same source code file and as we can understand from the function's name, it makes initialization of the memory descriptor or in other words the `bprm_mm_init` function initializes `mm_struct` structure. This structure defined in the [include/linux/mm_types.h](https://github.com/torvalds/linux/blob/master/include/mm_types.h) header file and represents address space of a process. We will not consider implementation of the `bprm_mm_init` function because we do not know many important stuff related to the Linux kernel memory manager, but we just need to know that this function initializes `mm_struct` and populate it with a temporary stack `vm_area_struct`.
|
||||
|
||||
After this we calculate the count of the command line arguments which are were passed to the our executable binary, the count of the environment variables and set it to the `bprm->argc` and `bprm->envc` respectively:
|
||||
|
||||
```C
|
||||
bprm->argc = count(argv, MAX_ARG_STRINGS);
|
||||
if ((retval = bprm->argc) < 0)
|
||||
goto out;
|
||||
|
||||
bprm->envc = count(envp, MAX_ARG_STRINGS);
|
||||
if ((retval = bprm->envc) < 0)
|
||||
goto out;
|
||||
```
|
||||
|
||||
As you can see we do this operations with the help of the `count` function that defined in the [same](https://github.com/torvalds/linux/blob/master/fs/exec.c) source code file and calculates the count of strings in the `argv` array. The `MAX_ARG_STRINGS` macro defined in the [include/uapi/linux/binfmts.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/binfmts.h) header file and as we can understand from the macro's name, it represents maximum number of strings that were passed to the `execve` system call. The value of the `MAX_ARG_STRINGS`:
|
||||
|
||||
```C
|
||||
#define MAX_ARG_STRINGS 0x7FFFFFFF
|
||||
```
|
||||
|
||||
After we calculated the number of the command line arguments and environment variables, we call the `prepare_binprm` function. We already call the function with the similar name before this moment. This function is called `prepare_binprm_cred` and we remember that this function initializes `cred` structure in the `linux_bprm`. Now the `prepare_binprm` function:
|
||||
|
||||
```C
|
||||
retval = prepare_binprm(bprm);
|
||||
if (retval < 0)
|
||||
goto out;
|
||||
```
|
||||
|
||||
fills the `linux_binprm` structure with the `uid` from [inode](https://en.wikipedia.org/wiki/Inode) and read `128` bytes from the binary executable file. We read only first `128` from the executable file because we need to check a type of our executable. We will read the rest of the executable file in the later step. After the preparation of the `linux_bprm` structure we copy the filename of the executable binary file, command line arguments and environment variables to the `linux_bprm` with the call of the `copy_strings_kernel` function:
|
||||
|
||||
```C
|
||||
retval = copy_strings_kernel(1, &bprm->filename, bprm);
|
||||
if (retval < 0)
|
||||
goto out;
|
||||
|
||||
retval = copy_strings(bprm->envc, envp, bprm);
|
||||
if (retval < 0)
|
||||
goto out;
|
||||
|
||||
retval = copy_strings(bprm->argc, argv, bprm);
|
||||
if (retval < 0)
|
||||
goto out;
|
||||
```
|
||||
|
||||
And set the pointer to the top of new program's stack that we set in the `bprm_mm_init` function:
|
||||
|
||||
```C
|
||||
bprm->exec = bprm->p;
|
||||
```
|
||||
|
||||
The top of the stack will contain the program filename and we store this filename to the `exec` field of the `linux_bprm` structure.
|
||||
|
||||
Now we have filled `linux_bprm` structure, we call the `exec_binprm` function:
|
||||
|
||||
```C
|
||||
retval = exec_binprm(bprm);
|
||||
if (retval < 0)
|
||||
goto out;
|
||||
```
|
||||
|
||||
First of all we store the [pid](https://en.wikipedia.org/wiki/Process_identifier) and `pid` that seen from the [namespace](https://en.wikipedia.org/wiki/Cgroups) of the current task in the `exec_binprm`:
|
||||
|
||||
```C
|
||||
old_pid = current->pid;
|
||||
rcu_read_lock();
|
||||
old_vpid = task_pid_nr_ns(current, task_active_pid_ns(current->parent));
|
||||
rcu_read_unlock();
|
||||
```
|
||||
|
||||
and call the:
|
||||
|
||||
```C
|
||||
search_binary_handler(bprm);
|
||||
```
|
||||
|
||||
function. This function goes through the list of handlers that contains different binary formats. Currently the Linux kernel supports following binary formats:
|
||||
|
||||
* `binfmt_script` - support for interpreted scripts that are starts from the [#!](https://en.wikipedia.org/wiki/Shebang_%28Unix%29) line;
|
||||
* `binfmt_misc` - support different binary formats, according to runtime configuration of the Linux kernel;
|
||||
* `binfmt_elf` - support [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) format;
|
||||
* `binfmt_aout` - support [a.out](https://en.wikipedia.org/wiki/A.out) format;
|
||||
* `binfmt_flat` - support for [flat](https://en.wikipedia.org/wiki/Binary_file#Structure) format;
|
||||
* `binfmt_elf_fdpic` - Support for [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) [FDPIC](http://elinux.org/UClinux_Shared_Library#FDPIC_ELF) binaries;
|
||||
* `binfmt_em86` - support for Intel [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) binaries running on [Alpha](https://en.wikipedia.org/wiki/DEC_Alpha) machines.
|
||||
|
||||
So, the search_binary_handler tries to call the `load_binary` function and pass `linux_binprm` to it. If the binary handler supports the given executable file format, it starts to prepare the executable binary for execution:
|
||||
|
||||
```C
|
||||
int search_binary_handler(struct linux_binprm *bprm)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
list_for_each_entry(fmt, &formats, lh) {
|
||||
retval = fmt->load_binary(bprm);
|
||||
if (retval < 0 && !bprm->mm) {
|
||||
force_sigsegv(SIGSEGV, current);
|
||||
return retval;
|
||||
}
|
||||
}
|
||||
|
||||
return retval;
|
||||
```
|
||||
|
||||
Where the `load_binary` for example for the [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) checks the magic number (each `elf` binary file contains magic number in the header) in the `linux_bprm` buffer (remember that we read first `128` bytes from the executable binary file): and exit if it is not `elf` binary:
|
||||
|
||||
```C
|
||||
static int load_elf_binary(struct linux_binprm *bprm)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
loc->elf_ex = *((struct elfhdr *)bprm->buf);
|
||||
|
||||
if (memcmp(elf_ex.e_ident, ELFMAG, SELFMAG) != 0)
|
||||
goto out;
|
||||
```
|
||||
|
||||
If the given executable file is in `elf` format, the `load_elf_binary` continues to execute. The `load_elf_binary` does many different things to prepare on execution executable file. For example it checks the architecture and type of the executable file:
|
||||
|
||||
```C
|
||||
if (loc->elf_ex.e_type != ET_EXEC && loc->elf_ex.e_type != ET_DYN)
|
||||
goto out;
|
||||
if (!elf_check_arch(&loc->elf_ex))
|
||||
goto out;
|
||||
```
|
||||
|
||||
and exit if there is wrong architecture and executable file non executable non shared. Tries to load the `program header table`:
|
||||
|
||||
```C
|
||||
elf_phdata = load_elf_phdrs(&loc->elf_ex, bprm->file);
|
||||
if (!elf_phdata)
|
||||
goto out;
|
||||
```
|
||||
|
||||
that describes [segments](https://en.wikipedia.org/wiki/Memory_segmentation). Read the `program interpreter` and libraries that linked with the our executable binary file from disk and load it to memory. The `program interpreter` specified in the `.interp` section of the executable file and as you can read in the part that describes [Linkers](http://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html) it is - `/lib64/ld-linux-x86-64.so.2` for the `x86_64`. It setups the stack and map `elf` binary into the correct location in memory. It maps the [bss](https://en.wikipedia.org/wiki/.bss) and the [brk](http://man7.org/linux/man-pages/man2/sbrk.2.html) sections and does many many other different things to prepare executable file to execute.
|
||||
|
||||
In the end of the execution of the `load_elf_binary` we call the `start_thread` function and pass three arguments to it:
|
||||
|
||||
```C
|
||||
start_thread(regs, elf_entry, bprm->p);
|
||||
retval = 0;
|
||||
out:
|
||||
kfree(loc);
|
||||
out_ret:
|
||||
return retval;
|
||||
```
|
||||
|
||||
These arguments are:
|
||||
|
||||
* Set of [registers](https://en.wikipedia.org/wiki/Processor_register) for the new task;
|
||||
* Address of the entry point of the new task;
|
||||
* Address of the top of the stack for the new task.
|
||||
|
||||
As we can understand from the function's name, it starts new thread, but it is not so. The `start_thread` function just prepares new task's registers to be ready to run. Let's look on the implementation of this function:
|
||||
|
||||
```C
|
||||
void
|
||||
start_thread(struct pt_regs *regs, unsigned long new_ip, unsigned long new_sp)
|
||||
{
|
||||
start_thread_common(regs, new_ip, new_sp,
|
||||
__USER_CS, __USER_DS, 0);
|
||||
}
|
||||
```
|
||||
|
||||
As we can see the `start_thread` function just makes a call of the `start_thread_common` function that will do all for us:
|
||||
|
||||
```C
|
||||
static void
|
||||
start_thread_common(struct pt_regs *regs, unsigned long new_ip,
|
||||
unsigned long new_sp,
|
||||
unsigned int _cs, unsigned int _ss, unsigned int _ds)
|
||||
{
|
||||
loadsegment(fs, 0);
|
||||
loadsegment(es, _ds);
|
||||
loadsegment(ds, _ds);
|
||||
load_gs_index(0);
|
||||
regs->ip = new_ip;
|
||||
regs->sp = new_sp;
|
||||
regs->cs = _cs;
|
||||
regs->ss = _ss;
|
||||
regs->flags = X86_EFLAGS_IF;
|
||||
force_iret();
|
||||
}
|
||||
```
|
||||
|
||||
The `start_thread_common` function fills `fs` segment register with zero and `es` and `ds` with the value of the data segment register. After this we set new values to the [instruction pointer](https://en.wikipedia.org/wiki/Program_counter), `cs` segments etc. In the end of the `start_thread_common` function we can see the `force_iret` macro that force a system call return via `iret` instruction. Ok, we prepared new thread to run in userspace and now we can return from the `exec_binprm` and now we are in the `do_execveat_common` again. After the `exec_binprm` will finish its execution we release memory for structures that was allocated before and return.
|
||||
|
||||
After we returned from the `execve` system call handler, execution of our program will be started. We can do it, because all context related information already configured for this purpose. As we saw the `execve` system call does not return control to a process, but code, data and other segments of the caller process are just overwritten of the program segments. The exit from our application will be implemented through the `exit` system call.
|
||||
|
||||
That's all. From this point our program will be executed.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the fourth and last part of the about the system calls concept in the Linux kernel. We saw almost all related stuff to the `system call` concept in these four parts. We started from the understanding of the `system call` concept, we have learned what is it and why do users applications need in this concept. Next we saw how does the Linux handle a system call from a user application. We met two similar concepts to the `system call` concept, they are `vsyscall` and `vDSO` and finally we saw how does Linux kernel run a user program.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [System call](https://en.wikipedia.org/wiki/System_call)
|
||||
* [shell](https://en.wikipedia.org/wiki/Unix_shell)
|
||||
* [bash](https://en.wikipedia.org/wiki/Bash_%28Unix_shell%29)
|
||||
* [entry point](https://en.wikipedia.org/wiki/Entry_point)
|
||||
* [C](https://en.wikipedia.org/wiki/C_%28programming_language%29)
|
||||
* [environment variables](https://en.wikipedia.org/wiki/Environment_variable)
|
||||
* [file descriptor](https://en.wikipedia.org/wiki/File_descriptor)
|
||||
* [real uid](https://en.wikipedia.org/wiki/User_identifier#Real_user_ID)
|
||||
* [virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system)
|
||||
* [procfs](https://en.wikipedia.org/wiki/Procfs)
|
||||
* [sysfs](https://en.wikipedia.org/wiki/Sysfs)
|
||||
* [inode](https://en.wikipedia.org/wiki/Inode)
|
||||
* [pid](https://en.wikipedia.org/wiki/Process_identifier)
|
||||
* [namespace](https://en.wikipedia.org/wiki/Cgroups)
|
||||
* [#!](https://en.wikipedia.org/wiki/Shebang_%28Unix%29)
|
||||
* [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format)
|
||||
* [a.out](https://en.wikipedia.org/wiki/A.out)
|
||||
* [flat](https://en.wikipedia.org/wiki/Binary_file#Structure)
|
||||
* [Alpha](https://en.wikipedia.org/wiki/DEC_Alpha)
|
||||
* [FDPIC](http://elinux.org/UClinux_Shared_Library#FDPIC_ELF)
|
||||
* [segments](https://en.wikipedia.org/wiki/Memory_segmentation)
|
||||
* [Linkers](http://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html)
|
||||
* [Processor register](https://en.wikipedia.org/wiki/Processor_register)
|
||||
* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-3.html)
|
@ -0,0 +1,11 @@
|
||||
# Timers and time management
|
||||
|
||||
This chapter describes timers and time management related concepts in the linux kernel.
|
||||
|
||||
* [Introduction](http://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) - this part is introduction to the timers in the Linux kernel.
|
||||
* [Introduction to the clocksource framework](https://github.com/0xAX/linux-insides/blob/master/Timers/timers-2.md) - this part describes `clocksource` framework in the Linux kernel.
|
||||
* [The tick broadcast framework and dyntick](https://github.com/0xAX/linux-insides/blob/master/Timers/timers-3.md) - this part describes tick broadcast framework and dyntick concept.
|
||||
* [Introduction to timers](https://github.com/0xAX/linux-insides/blob/master/Timers/timers-3.md) - this chapter describes timers in the Linux kernel.
|
||||
* [Introduction to the clockevents framework](https://github.com/0xAX/linux-insides/blob/master/Timers/timers-5.md) - this part describes yet another clock/time management related framework - `clockevents`.
|
||||
* [x86 related clock sources](https://github.com/0xAX/linux-insides/blob/master/Timers/timers-5.md) - this part describes `x86_64` related clock sources.
|
||||
* [Time related system calls in the Linux kernel](https://github.com/0xAX/linux-insides/blob/master/Timers/timers-7.md) - this part describes time related system calls.
|
@ -0,0 +1,436 @@
|
||||
Timers and time management in the Linux kernel. Part 1.
|
||||
================================================================================
|
||||
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is yet another post that opens new chapter in the [linux-insides](http://0xax.gitbooks.io/linux-insides/content/) book. The previous [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-4.html) was a list part of the chapter that describes [system call](https://en.wikipedia.org/wiki/System_call) concept and now time is to start new chapter. As you can understand from the post's title, this chapter will be devoted to the `timers` and `time management` in the Linux kernel. The choice of topic for the current chapter is not accidental. Timers and generally time management are very important and widely used in the Linux kernel. The Linux kernel uses timers for various tasks, different timeouts for example in [TCP](https://en.wikipedia.org/wiki/Transmission_Control_Protocol) implementation, the kernel must know current time, scheduling asynchronous functions, next event interrupt scheduling and many many more.
|
||||
|
||||
So, we will start to learn implementation of the different time management related stuff in this part. We will see different types of timers and how do different Linux kernel subsystems use them. As always we will start from the earliest part of the Linux kernel and will go through initialization process of the Linux kernel. We already did it in the special [chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) which describes initialization process of the Linux kernel, but as you may remember we missed some things there. And one of them is the initialization of timers.
|
||||
|
||||
Let's start.
|
||||
|
||||
Initialization of non-standard PC hardware clock
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After the Linux kernel was decompressed (more about this you can read in the [Kernel decompression](https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) part) the architecture non-specific code starts to work in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file. After initialization of the [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt), initialization of [cgroups](https://en.wikipedia.org/wiki/Cgroups) and setting [canary](https://en.wikipedia.org/wiki/Buffer_overflow_protection) value we can see the call of the `setup_arch` function.
|
||||
|
||||
As you may remember this function defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c#L842) source code file and prepares/initializes architecture-specific stuff (for example it reserves place for [bss](https://en.wikipedia.org/wiki/.bss) section, reserves place for [initrd](https://en.wikipedia.org/wiki/Initrd), parses kernel command line and many many other things). Besides this, we can find some time management related functions there.
|
||||
|
||||
The first is:
|
||||
|
||||
```C
|
||||
x86_init.timers.wallclock_init();
|
||||
```
|
||||
|
||||
We already saw `x86_init` structure in the chapter that describes initialization of the Linux kernel. This structure contains pointers to the default setup functions for the different platforms like [Intel MID](https://en.wikipedia.org/wiki/Mobile_Internet_device#Intel_MID_platforms), [Intel CE4100](http://www.wpgholdings.com/epaper/US/newsRelease_20091215/255874.pdf) and etc. The `x86_init` structure defined in the [arch/x86/kernel/x86_init.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/x86_init.c#L36) and as you can see it determines standard PC hardware by default.
|
||||
|
||||
As we can see, the `x86_init` structure has `x86_init_ops` type that provides a set of functions for platform specific setup like reserving standard resources, platform specific memory setup, initialization of interrupt handlers and etc. This structure looks like:
|
||||
|
||||
```C
|
||||
struct x86_init_ops {
|
||||
struct x86_init_resources resources;
|
||||
struct x86_init_mpparse mpparse;
|
||||
struct x86_init_irqs irqs;
|
||||
struct x86_init_oem oem;
|
||||
struct x86_init_paging paging;
|
||||
struct x86_init_timers timers;
|
||||
struct x86_init_iommu iommu;
|
||||
struct x86_init_pci pci;
|
||||
};
|
||||
```
|
||||
|
||||
We can note `timers` field that has `x86_init_timers` type and as we can understand by its name - this field is related to time management and timers. The `x86_init_timers` contains four fields which are all functions that returns pointer on [void](https://en.wikipedia.org/wiki/Void_type):
|
||||
|
||||
* `setup_percpu_clockev` - set up the per cpu clock event device for the boot cpu;
|
||||
* `tsc_pre_init` - platform function called before [TSC](https://en.wikipedia.org/wiki/Time_Stamp_Counter) init;
|
||||
* `timer_init` - initialize the platform timer;
|
||||
* `wallclock_init` - initialize the wallclock device.
|
||||
|
||||
So, as we already know, in our case the `wallclock_init` executes initialization of the wallclock device. If we will look on the `x86_init` structure, we will see that `wallclock_init` points to the `x86_init_noop`:
|
||||
|
||||
```C
|
||||
struct x86_init_ops x86_init __initdata = {
|
||||
...
|
||||
...
|
||||
...
|
||||
.timers = {
|
||||
.wallclock_init = x86_init_noop,
|
||||
},
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
Where the `x86_init_noop` is just a function that does nothing:
|
||||
|
||||
```C
|
||||
void __cpuinit x86_init_noop(void) { }
|
||||
```
|
||||
|
||||
for the standard PC hardware. Actually, the `wallclock_init` function is used in the [Intel MID](https://en.wikipedia.org/wiki/Mobile_Internet_device#Intel_MID_platforms) platform. Initialization of the `x86_init.timers.wallclock_init` located in the [arch/x86/platform/intel-mid/intel-mid.c](https://github.com/torvalds/linux/blob/master/arch/x86/platform/intel-mid/intel-mid.c) source code file in the `x86_intel_mid_early_setup` function:
|
||||
|
||||
```C
|
||||
void __init x86_intel_mid_early_setup(void)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
x86_init.timers.wallclock_init = intel_mid_rtc_init;
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
Implementation of the `intel_mid_rtc_init` function is in the [arch/x86/platform/intel-mid/intel_mid_vrtc.c](https://github.com/torvalds/linux/blob/master/arch/x86/platform/intel-mid/intel_mid_vrtc.c) source code file and looks pretty easy. First of all, this function parses [Simple Firmware Interface](https://en.wikipedia.org/wiki/Simple_Firmware_Interface) M-Real-Time-Clock table for the getting such devices to the `sfi_mrtc_array` array and initialization of the `set_time` and `get_time` functions:
|
||||
|
||||
```C
|
||||
void __init intel_mid_rtc_init(void)
|
||||
{
|
||||
unsigned long vrtc_paddr;
|
||||
|
||||
sfi_table_parse(SFI_SIG_MRTC, NULL, NULL, sfi_parse_mrtc);
|
||||
|
||||
vrtc_paddr = sfi_mrtc_array[0].phys_addr;
|
||||
if (!sfi_mrtc_num || !vrtc_paddr)
|
||||
return;
|
||||
|
||||
vrtc_virt_base = (void __iomem *)set_fixmap_offset_nocache(FIX_LNW_VRTC,
|
||||
vrtc_paddr);
|
||||
|
||||
x86_platform.get_wallclock = vrtc_get_time;
|
||||
x86_platform.set_wallclock = vrtc_set_mmss;
|
||||
}
|
||||
```
|
||||
|
||||
That's all, after this a device based on `Intel MID` will be able to get time from hardware clock. As I already wrote, the standard PC [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture does not support `x86_init_noop` and just do nothing during call of this function. We just saw initialization of the [real time clock](https://en.wikipedia.org/wiki/Real-time_clock) for the [Intel MID](https://en.wikipedia.org/wiki/Mobile_Internet_device#Intel_MID_platforms) architecture and now times to return to the general `x86_64` architecture and will look on the time management related stuff there.
|
||||
|
||||
Acquainted with jiffies
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
If we will return to the `setup_arch` function which is located as you remember in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c#L842) source code file, we will see the next call of the time management related function:
|
||||
|
||||
```C
|
||||
register_refined_jiffies(CLOCK_TICK_RATE);
|
||||
```
|
||||
|
||||
Before we will look on the implementation of this function, we must know about [jiffy](https://en.wikipedia.org/wiki/Jiffy_%28time%29). As we can read on wikipedia:
|
||||
|
||||
```
|
||||
Jiffy is an informal term for any unspecified short period of time
|
||||
```
|
||||
|
||||
This definition is very similar to the `jiffy` in the Linux kernel. There is global variable with the `jiffies` which holds the number of ticks that have occurred since the system booted. The Linux kernel sets this variable to zero:
|
||||
|
||||
```C
|
||||
extern unsigned long volatile __jiffy_data jiffies;
|
||||
```
|
||||
|
||||
during initialization process. This global variable will be increased each time during timer interrupt. Besides this, near the `jiffies` variable we can see definition of the similar variable
|
||||
|
||||
```C
|
||||
extern u64 jiffies_64;
|
||||
```
|
||||
|
||||
Actually only one of these variables is in use in the Linux kernel. And it depends on the processor type. For the [x86_64](https://en.wikipedia.org/wiki/X86-64) it will be `u64` use and for the [x86](https://en.wikipedia.org/wiki/X86) is `unsigned long`. We will see this if we will look on the [arch/x86/kernel/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S) linker script:
|
||||
|
||||
```
|
||||
#ifdef CONFIG_X86_32
|
||||
...
|
||||
jiffies = jiffies_64;
|
||||
...
|
||||
#else
|
||||
...
|
||||
jiffies_64 = jiffies;
|
||||
...
|
||||
#endif
|
||||
```
|
||||
|
||||
In the case of `x86_32` the `jiffies` will be lower `32` bits of the `jiffies_64` variable. Schematically, we can imagine it as follows
|
||||
|
||||
```
|
||||
jiffies_64
|
||||
+-----------------------------------------------------+
|
||||
| | |
|
||||
| | |
|
||||
| | jiffies on `x86_32` |
|
||||
| | |
|
||||
| | |
|
||||
+-----------------------------------------------------+
|
||||
63 31 0
|
||||
```
|
||||
|
||||
Now we know a little theory about `jiffies` and we can return to the our function. There is no architecture-specific implementation for our function - the `register_refined_jiffies`. This function located in the generic kernel code - [kernel/time/jiffies.c](https://github.com/torvalds/linux/blob/master/kernel/time/jiffies.c) source code file. Main point of the `register_refined_jiffies` is registration of the jiffy `clocksource`. Before we will look on the implementation of the `register_refined_jiffies` function, we must know what is it `clocksource`. As we can read in the comments:
|
||||
|
||||
```
|
||||
The `clocksource` is hardware abstraction for a free-running counter.
|
||||
```
|
||||
|
||||
I'm not sure about you, but that description didn't give a good understanding about the `clocksource` concept. Let's try to understand what is it, but we will not go deeper because this topic will be described in a separate part in much more detail. The main point of the `clocksource` is timekeeping abstraction or in very simple words - it provides a time value to the kernel. We already know about `jiffies` interface that represents number of ticks that have occurred since the system booted. It represented by the global variable in the Linux kernel and increased each timer interrupt. The Linux kernel can use `jiffies` for time measurement. So why do we need in separate context like the `clocksource`? Actually different hardware devices provide different clock sources that are widely in their capabilities. The availability of more precise techniques for time intervals measurement is hardware-dependent.
|
||||
|
||||
For example `x86` has on-chip a 64-bit counter that is called [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) and its frequency can be equal to processor frequency. Or for example [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) that consists of a `64-bit` counter of at least `10 MHz` frequency. Two different timers and they are both for `x86`. If we will add timers from other architectures, this only makes this problem more complex. The Linux kernel provides `clocksource` concept to solve the problem.
|
||||
|
||||
The clocksource concept represented by the `clocksource` structure in the Linux kernel. This structure defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/blob/master/include/linux/clocksource.h) header file and contains a couple of fields that describe a time counter. For example it contains - `name` field which is the name of a counter, `flags` field that describes different properties of a counter, pointers to the `suspend` and `resume` functions, and many more.
|
||||
|
||||
Let's look on the `clocksource` structure for jiffies that defined in the [kernel/time/jiffies.c](https://github.com/torvalds/linux/blob/master/kernel/time/jiffies.c) source code file:
|
||||
|
||||
```C
|
||||
static struct clocksource clocksource_jiffies = {
|
||||
.name = "jiffies",
|
||||
.rating = 1,
|
||||
.read = jiffies_read,
|
||||
.mask = 0xffffffff,
|
||||
.mult = NSEC_PER_JIFFY << JIFFIES_SHIFT,
|
||||
.shift = JIFFIES_SHIFT,
|
||||
.max_cycles = 10,
|
||||
};
|
||||
```
|
||||
|
||||
We can see definition of the default name here - `jiffies`, the next is `rating` field allows the best registered clock source to be chosen by the clock source management code available for the specified hardware. The `rating` may have following value:
|
||||
|
||||
* `1-99` - Only available for bootup and testing purposes;
|
||||
* `100-199` - Functional for real use, but not desired.
|
||||
* `200-299` - A correct and usable clocksource.
|
||||
* `300-399` - A reasonably fast and accurate clocksource.
|
||||
* `400-499` - The ideal clocksource. A must-use where available;
|
||||
|
||||
For example rating of the [time stamp counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) is `300`, but rating of the [high precision event timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) is `250`. The next field is `read` - is pointer to the function that allows to read clocksource's cycle value or in other words it just returns `jiffies` variable with `cycle_t` type:
|
||||
|
||||
```C
|
||||
static cycle_t jiffies_read(struct clocksource *cs)
|
||||
{
|
||||
return (cycle_t) jiffies;
|
||||
}
|
||||
```
|
||||
|
||||
that is just 64-bit unsigned type:
|
||||
|
||||
```C
|
||||
typedef u64 cycle_t;
|
||||
```
|
||||
|
||||
The next field is the `mask` value ensures that subtraction between counters values from non `64 bit` counters do not need special overflow logic. In our case the mask is `0xffffffff` and it is `32` bits. This means that `jiffy` wraps around to zero after `42` seconds:
|
||||
|
||||
```python
|
||||
>>> 0xffffffff
|
||||
4294967295
|
||||
# 42 nanoseconds
|
||||
>>> 42 * pow(10, -9)
|
||||
4.2000000000000006e-08
|
||||
# 43 nanoseconds
|
||||
>>> 43 * pow(10, -9)
|
||||
4.3e-08
|
||||
```
|
||||
|
||||
The next two fields `mult` and `shift` are used to convert the clocksource's period to nanoseconds per cycle. When the kernel calls the `clocksource.read` function, this function returns value in `machine` time units represented with `cycle_t` data type that we saw just now. To convert this return value to the [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond) we need in these two fields: `mult` and `shift`. The `clocksource` provides `clocksource_cyc2ns` function that will do it for us with the following expression:
|
||||
|
||||
```C
|
||||
((u64) cycles * mult) >> shift;
|
||||
```
|
||||
|
||||
As we can see the `mult` field is equal:
|
||||
|
||||
```C
|
||||
NSEC_PER_JIFFY << JIFFIES_SHIFT
|
||||
|
||||
#define NSEC_PER_JIFFY ((NSEC_PER_SEC+HZ/2)/HZ)
|
||||
#define NSEC_PER_SEC 1000000000L
|
||||
```
|
||||
|
||||
by default, and the `shift` is
|
||||
|
||||
```C
|
||||
#if HZ < 34
|
||||
#define JIFFIES_SHIFT 6
|
||||
#elif HZ < 67
|
||||
#define JIFFIES_SHIFT 7
|
||||
#else
|
||||
#define JIFFIES_SHIFT 8
|
||||
#endif
|
||||
```
|
||||
|
||||
The `jiffies` clock source uses the `NSEC_PER_JIFFY` multiplier conversion to specify the nanosecond over cycle ratio. Note that values of the `JIFFIES_SHIFT` and `NSEC_PER_JIFFY` depend on `HZ` value. The `HZ` represents the frequency of the system timer. This macro defined in the [include/asm-generic/param.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/param.h) and depends on the `CONFIG_HZ` kernel configuration option. The value of `HZ` differs for each supported architecture, but for `x86` it's defined like:
|
||||
|
||||
```C
|
||||
#define HZ CONFIG_HZ
|
||||
```
|
||||
|
||||
Where `CONFIG_HZ` can be one of the following values:
|
||||
|
||||
![HZ](http://s9.postimg.org/xy85r3jrj/image.png)
|
||||
|
||||
This means that in our case the timer interrupt frequency is `250 HZ` or occurs `250` times per second or one timer interrupt each `4ms`.
|
||||
|
||||
The last field that we can see in the definition of the `clocksource_jiffies` structure is the - `max_cycles` that holds the maximum cycle value that can safely be multiplied without potentially causing an overflow.
|
||||
|
||||
Ok, we just saw definition of the `clocksource_jiffies` structure, also we know a little about `jiffies` and `clocksource`, now is time to get back to the implementation of the our function. In the beginning of this part we have stopped on the call of the:
|
||||
|
||||
```C
|
||||
register_refined_jiffies(CLOCK_TICK_RATE);
|
||||
```
|
||||
|
||||
function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c#L842) source code file.
|
||||
|
||||
As I already wrote, the main purpose of the `register_refined_jiffies` function is to register `refined_jiffies` clocksource. We already saw the `clocksource_jiffies` structure represents standard `jiffies` clock source. Now, if you look in the [kernel/time/jiffies.c](https://github.com/torvalds/linux/blob/master/kernel/time/jiffies.c) source code file, you will find yet another clock source definition:
|
||||
|
||||
```C
|
||||
struct clocksource refined_jiffies;
|
||||
```
|
||||
|
||||
There is one different between `refined_jiffies` and `clocksource_jiffies`: The standard `jiffies` based clock source is the lowest common denominator clock source which should function on all systems. As we already know, the `jiffies` global variable will be increased during each timer interrupt. This means that standard `jiffies` based clock source has the same resolution as the timer interrupt frequency. From this we can understand that standard `jiffies` based clock source may suffer from inaccuracies. The `refined_jiffies` uses `CLOCK_TICK_RATE` as the base of `jiffies` shift.
|
||||
|
||||
Let's look on the implementation of this function. First of all we can see that the `refined_jiffies` clock source based on the `clocksource_jiffies` structure:
|
||||
|
||||
```C
|
||||
int register_refined_jiffies(long cycles_per_second)
|
||||
{
|
||||
u64 nsec_per_tick, shift_hz;
|
||||
long cycles_per_tick;
|
||||
|
||||
refined_jiffies = clocksource_jiffies;
|
||||
refined_jiffies.name = "refined-jiffies";
|
||||
refined_jiffies.rating++;
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
Here we can see that we update the name of the `refined_jiffies` to `refined-jiffies` and increase the rating of this structure. As you remember, the `clocksource_jiffies` has rating - `1`, so our `refined_jiffies` clocksource will have rating - `2`. This means that the `refined_jiffies` will be best selection for clock source management code.
|
||||
|
||||
In the next step we need to calculate number of cycles per one tick:
|
||||
|
||||
```C
|
||||
cycles_per_tick = (cycles_per_second + HZ/2)/HZ;
|
||||
```
|
||||
|
||||
Note that we have used `NSEC_PER_SEC` macro as the base of the standard `jiffies` multiplier. Here we are using the `cycles_per_second` which is the first parameter of the `register_refined_jiffies` function. We've passed the `CLOCK_TICK_RATE` macro to the `register_refined_jiffies` function. This macro definied in the [arch/x86/include/asm/timex.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/timex.h) header file and expands to the:
|
||||
|
||||
```C
|
||||
#define CLOCK_TICK_RATE PIT_TICK_RATE
|
||||
```
|
||||
|
||||
where the `PIT_TICK_RATE` macro expands to the frequency of the [Intel 8253](Programmable interval timer):
|
||||
|
||||
```C
|
||||
#define PIT_TICK_RATE 1193182ul
|
||||
```
|
||||
|
||||
After this we calculate `shift_hz` for the `register_refined_jiffies` that will store `hz << 8` or in other words frequency of the system timer. We shift left the `cycles_per_second` or frequency of the programmable interval timer on `8` in order to get extra accuracy:
|
||||
|
||||
```C
|
||||
shift_hz = (u64)cycles_per_second << 8;
|
||||
shift_hz += cycles_per_tick/2;
|
||||
do_div(shift_hz, cycles_per_tick);
|
||||
```
|
||||
|
||||
In the next step we calculate the number of seconds per one tick by shifting left the `NSEC_PER_SEC` on `8` too as we did it with the `shift_hz` and do the same calculation as before:
|
||||
|
||||
```C
|
||||
nsec_per_tick = (u64)NSEC_PER_SEC << 8;
|
||||
nsec_per_tick += (u32)shift_hz/2;
|
||||
do_div(nsec_per_tick, (u32)shift_hz);
|
||||
```
|
||||
|
||||
```C
|
||||
refined_jiffies.mult = ((u32)nsec_per_tick) << JIFFIES_SHIFT;
|
||||
```
|
||||
|
||||
In the end of the `register_refined_jiffies` function we register new clock source with the `__clocksource_register` function that defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/blob/master/include/linux/clocksource.h) header file and return:
|
||||
|
||||
```C
|
||||
__clocksource_register(&refined_jiffies);
|
||||
return 0;
|
||||
```
|
||||
|
||||
The clock source management code provides the API for clock source registration and selection. As we can see, clock sources are registered by calling the `__clocksource_register` function during kernel initialization or from a kernel module. During registration, the clock source management code will choose the best clock source available in the system using the `clocksource.rating` field which we already saw when we initialized `clocksource` structure for `jiffies`.
|
||||
|
||||
Using the jiffies
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
We just saw initialization of two `jiffies` based clock sources in the previous paragraph:
|
||||
|
||||
* standard `jiffies` based clock source;
|
||||
* refined `jiffies` based clock source;
|
||||
|
||||
Don't worry if you don't understand the calculations here. They look frightening at first. Soon, step by step we will learn these things. So, we just saw initialization of `jffies` based clock sources and also we know that the Linux kernel has the global variable `jiffies` that holds the number of ticks that have occurred since the kernel started to work. Now, let's look how to use it. To use `jiffies` we just can use `jiffies` global variable by its name or with the call of the `get_jiffies_64` function. This function defined in the [kernel/time/jiffies.c](https://github.com/torvalds/linux/blob/master/kernel/time/jiffies.c) source code file and just returns full `64-bit` value of the `jiffies`:
|
||||
|
||||
```C
|
||||
u64 get_jiffies_64(void)
|
||||
{
|
||||
unsigned long seq;
|
||||
u64 ret;
|
||||
|
||||
do {
|
||||
seq = read_seqbegin(&jiffies_lock);
|
||||
ret = jiffies_64;
|
||||
} while (read_seqretry(&jiffies_lock, seq));
|
||||
return ret;
|
||||
}
|
||||
EXPORT_SYMBOL(get_jiffies_64);
|
||||
```
|
||||
|
||||
Note that the `get_jiffies_64` function does not implemented as `jiffies_read` for example:
|
||||
|
||||
```C
|
||||
static cycle_t jiffies_read(struct clocksource *cs)
|
||||
{
|
||||
return (cycle_t) jiffies;
|
||||
}
|
||||
```
|
||||
|
||||
We can see that implementation of the `get_jiffies_64` is more complex. The reading of the `jiffies_64` variable is implemented using [seqlocks](https://en.wikipedia.org/wiki/Seqlock). Actually this is done for machines that cannot atomically read the full 64-bit values.
|
||||
|
||||
If we can access the `jiffies` or the `jiffies_64` variable we can convert it to `human` time units. To get one second we can use following expression:
|
||||
|
||||
```C
|
||||
jiffies / HZ
|
||||
```
|
||||
|
||||
So, if we know this, we can get any time units. For example:
|
||||
|
||||
```C
|
||||
/* Thirty seconds from now */
|
||||
jiffies + 30*HZ
|
||||
|
||||
/* Two minutes from now */
|
||||
jiffies + 120*HZ
|
||||
|
||||
/* One millisecond from now */
|
||||
jiffies + HZ / 1000
|
||||
```
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This concludes the first part covering time and time management related concepts in the Linux kernel. We met first two concepts and its initialization in this part: `jiffies` and `clocksource`. In the next part we will continue to dive into this interesting theme and as I already wrote in this part we will acquainted and try to understand insides of these and other time management concepts in the Linux kernel.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [system call](https://en.wikipedia.org/wiki/System_call)
|
||||
* [TCP](https://en.wikipedia.org/wiki/Transmission_Control_Protocol)
|
||||
* [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt)
|
||||
* [cgroups](https://en.wikipedia.org/wiki/Cgroups)
|
||||
* [bss](https://en.wikipedia.org/wiki/.bss)
|
||||
* [initrd](https://en.wikipedia.org/wiki/Initrd)
|
||||
* [Intel MID](https://en.wikipedia.org/wiki/Mobile_Internet_device#Intel_MID_platforms)
|
||||
* [TSC](https://en.wikipedia.org/wiki/Time_Stamp_Counter)
|
||||
* [void](https://en.wikipedia.org/wiki/Void_type)
|
||||
* [Simple Firmware Interface](https://en.wikipedia.org/wiki/Simple_Firmware_Interface)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [real time clock](https://en.wikipedia.org/wiki/Real-time_clock)
|
||||
* [Jiffy](https://en.wikipedia.org/wiki/Jiffy_%28time%29)
|
||||
* [high precision event timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer)
|
||||
* [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond)
|
||||
* [Intel 8253](https://en.wikipedia.org/wiki/Intel_8253)
|
||||
* [seqlocks](https://en.wikipedia.org/wiki/Seqlock)
|
||||
* [cloksource documentation](https://www.kernel.org/doc/Documentation/timers/timekeeping.txt)
|
||||
* [Previous chapter](https://0xax.gitbooks.io/linux-insides/content/SysCall/index.html)
|
@ -0,0 +1,451 @@
|
||||
Timers and time management in the Linux kernel. Part 2.
|
||||
================================================================================
|
||||
|
||||
Introduction to the `clocksource` framework
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) was the first part in the current [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) that describes timers and time management related stuff in the Linux kernel. We got acquainted with two concepts in the previous part:
|
||||
|
||||
* `jiffies`
|
||||
* `clocksource`
|
||||
|
||||
The first is the global variable that is defined in the [include/linux/jiffies.h](https://github.com/torvalds/linux/blob/master/include/linux/jiffies.h) header file and represents the counter that is increased during each timer interrupt. So if we can access this global variable and we know the timer interrupt rate we can convert `jiffies` to the human time units. As we already know the timer interrupt rate represented by the compile-time constant that is called `HZ` in the Linux kernel. The value of `HZ` is equal to the value of the `CONFIG_HZ` kernel configuration option and if we will look into the [arch/x86/configs/x86_64_defconfig](https://github.com/torvalds/linux/blob/master/arch/x86/configs/x86_64_defconfig) kernel configuration file, we will see that:
|
||||
|
||||
```
|
||||
CONFIG_HZ_1000=y
|
||||
```
|
||||
|
||||
kernel configuration option is set. This means that value of `CONFIG_HZ` will be `1000` by default for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture. So, if we divide the value of `jiffies` by the value of `HZ`:
|
||||
|
||||
```
|
||||
jiffies / HZ
|
||||
```
|
||||
|
||||
we will get the amount of seconds that elapsed since the beginning of the moment the Linux kernel started to work or in other words we will get the system [uptime](https://en.wikipedia.org/wiki/Uptime). Since `HZ` represents the amount of timer interrupts in a second, we can set a value for some time in the future. For example:
|
||||
|
||||
```C
|
||||
/* one minute from now */
|
||||
unsigned long later = jiffies + 60*HZ;
|
||||
|
||||
/* five minutes from now */
|
||||
unsigned long later = jiffies + 5*60*HZ;
|
||||
```
|
||||
|
||||
This is a very common practice in the Linux kernel. For example, if you will look into the [arch/x86/kernel/smpboot.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/smpboot.c) source code file, you will find the `do_boot_cpu` function. This function boots all processors besides bootstrap processor. You can find a snippet that waits ten seconds for a response from the application processor:
|
||||
|
||||
```C
|
||||
if (!boot_error) {
|
||||
timeout = jiffies + 10*HZ;
|
||||
while (time_before(jiffies, timeout)) {
|
||||
...
|
||||
...
|
||||
...
|
||||
udelay(100);
|
||||
}
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
We assign `jiffies + 10*HZ` value to the `timeout` variable here. As I think you already understood, this means a ten seconds timeout. After this we are entering a loop where we use the `time_before` macro to compare the current `jiffies` value and our timeout.
|
||||
|
||||
Or for example if we look into the [sound/isa/sscape.c](https://github.com/torvalds/linux/blob/master/sound/isa/sscape) source code file which represents the driver for the [Ensoniq Soundscape Elite](https://en.wikipedia.org/wiki/Ensoniq_Soundscape_Elite) sound card, we will see the `obp_startup_ack` function that waits upto a given timeout for the On-Board Processor to return its start-up acknowledgement sequence:
|
||||
|
||||
```C
|
||||
static int obp_startup_ack(struct soundscape *s, unsigned timeout)
|
||||
{
|
||||
unsigned long end_time = jiffies + msecs_to_jiffies(timeout);
|
||||
|
||||
do {
|
||||
...
|
||||
...
|
||||
...
|
||||
x = host_read_unsafe(s->io_base);
|
||||
...
|
||||
...
|
||||
...
|
||||
if (x == 0xfe || x == 0xff)
|
||||
return 1;
|
||||
msleep(10);
|
||||
} while (time_before(jiffies, end_time));
|
||||
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
As you can see, the `jiffies` variable is very widely used in the Linux kernel [code](http://lxr.free-electrons.com/ident?i=jiffies). As I already wrote, we met yet another new time management related concept in the previous part - `clocksource`. We have only seen a short description of this concept and the API for a clock source registration. Let's take a closer look in this part.
|
||||
|
||||
Introduction to `clocksource`
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The `clocksource` concept represents the generic API for clock sources management in the Linux kernel. Why do we need a separate framework for this? Let's go back to the beginning. The `time` concept is the fundamental concept in the Linux kernel and other operating system kernels. And the timekeeping is one of the necessities to use this concept. For example Linux kernel must know and update the time elapsed since system startup, it must determine how long the current process has been running for every processor and many many more. Where the Linux kernel can get information about time? First of all it is Real Time Clock or [RTC](https://en.wikipedia.org/wiki/Real-time_clock) that represents by the a nonvolatile device. You can find a set of architecture-independent real time clock drivers in the Linux kernel in the [drivers/rtc](https://github.com/torvalds/linux/tree/master/drivers/rtc) directory. Besides this, each architecture can provide a driver for the architecture-dependent real time clock, for example - `CMOS/RTC` - [arch/x86/kernel/rtc.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/rtc.c) for the [x86](https://en.wikipedia.org/wiki/X86) architecture. The second is system timer - timer that excites [interrupts](https://en.wikipedia.org/wiki/Interrupt) with a periodic rate. For example, for [IBM PC](https://en.wikipedia.org/wiki/IBM_Personal_Computer) compatibles it was - [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer).
|
||||
|
||||
We already know that for timekeeping purposes we can use `jiffies` in the Linux kernel. The `jiffies` can be considered as read only global variable which is updated with `HZ` frequency. We know that the `HZ` is a compile-time kernel parameter whose reasonable range is from `100` to `1000` [Hz](https://en.wikipedia.org/wiki/Hertz). So, it is guaranteed to have an interface for time measurement with `1` - `10` milliseconds resolution. Besides standard `jiffies`, we saw the `refined_jiffies` clock source in the previous part that is based on the `i8253/i8254` [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer) tick rate which is almost `1193182` hertz. So we can get something about `1` microsecond resolution with the `refined_jiffies`. In this time, [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond) are the favorite choice for the time value units of the given clock source.
|
||||
|
||||
The availability of more precise techniques for time intervals measurement is hardware-dependent. We just knew a little about `x86` dependent timers hardware. But each architecture provides own timers hardware. Earlier each architecture had own implementation for this purpose. Solution of this problem is an abstraction layer and associated API in a common code framework for managing various clock sources and independent of the timer interrupt. This common code framework became - `clocksource` framework.
|
||||
|
||||
Generic timeofday and clock source management framework moved a lot of timekeeping code into the architecture independent portion of the code, with the architecture-dependent portion reduced to defining and managing low-level hardware pieces of clocksources. It takes a large amount of funds to measure the time interval on different architectures with different hardware, and it is very complex. Implementation of the each clock related service is strongly associated with an individual hardware device and as you can understand, it results in similar implementations for different architectures.
|
||||
|
||||
Within this framework, each clock source is required to maintain a representation of time as a monotonically increasing value. As we can see in the Linux kernel code, nanoseconds are the favorite choice for the time value units of a clock source in this time. One of the main point of the clock source framework is to allow an user to select clock source among a range of available hardware devices supporting clock functions when configuring the system and selecting, accessing and scaling different clock sources.
|
||||
|
||||
The clocksource structure
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The fundamental of the `clocksource` framework is the `clocksource` structure that defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/blob/master/include/linux/clocksource.h) header file. We already saw some fields that are provided by the `clocksource` structure in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html). Let's look on the full definition of this structure and try to describe all of its fields:
|
||||
|
||||
```C
|
||||
struct clocksource {
|
||||
cycle_t (*read)(struct clocksource *cs);
|
||||
cycle_t mask;
|
||||
u32 mult;
|
||||
u32 shift;
|
||||
u64 max_idle_ns;
|
||||
u32 maxadj;
|
||||
#ifdef CONFIG_ARCH_CLOCKSOURCE_DATA
|
||||
struct arch_clocksource_data archdata;
|
||||
#endif
|
||||
u64 max_cycles;
|
||||
const char *name;
|
||||
struct list_head list;
|
||||
int rating;
|
||||
int (*enable)(struct clocksource *cs);
|
||||
void (*disable)(struct clocksource *cs);
|
||||
unsigned long flags;
|
||||
void (*suspend)(struct clocksource *cs);
|
||||
void (*resume)(struct clocksource *cs);
|
||||
#ifdef CONFIG_CLOCKSOURCE_WATCHDOG
|
||||
struct list_head wd_list;
|
||||
cycle_t cs_last;
|
||||
cycle_t wd_last;
|
||||
#endif
|
||||
struct module *owner;
|
||||
} ____cacheline_aligned;
|
||||
```
|
||||
|
||||
We already saw the first field of the `clocksource` structure in the previous part - it is pointer to the `read` function that returns best counter selected by the clocksource framework. For example we use `jiffies_read` function to read `jiffies` value:
|
||||
|
||||
```C
|
||||
static struct clocksource clocksource_jiffies = {
|
||||
...
|
||||
.read = jiffies_read,
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
where `jiffies_read` just returns:
|
||||
|
||||
```C
|
||||
static cycle_t jiffies_read(struct clocksource *cs)
|
||||
{
|
||||
return (cycle_t) jiffies;
|
||||
}
|
||||
```
|
||||
|
||||
Or the `read_tsc` function:
|
||||
|
||||
```C
|
||||
static struct clocksource clocksource_tsc = {
|
||||
...
|
||||
.read = read_tsc,
|
||||
...
|
||||
};
|
||||
```
|
||||
|
||||
for the [time stamp counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) reading.
|
||||
|
||||
The next field is `mask` that allows to ensure that subtraction between counters values from non `64 bit` counters do not need special overflow logic. After the `mask` field, we can see two fields: `mult` and `shift`. These are the fields that are base of mathematical functions that are provide ability to convert time values specific to each clock source. In other words these two fields help us to convert an abstract machine time units of a counter to nanoseconds.
|
||||
|
||||
After these two fields we can see the `64` bits `max_idle_ns` field represents max idle time permitted by the clocksource in nanoseconds. We need in this field for the Linux kernel with enabled `CONFIG_NO_HZ` kernel configuration option. This kernel configuration option enables the Linux kernel to run without a regular timer tick (we will see full explanation of this in other part). The problem that dynamic tick allows the kernel to sleep for periods longer than a single tick, moreover sleep time could be unlimited. The `max_idle_ns` field represents this sleeping limit.
|
||||
|
||||
The next field after the `max_idle_ns` is the `maxadj` field which is the maximum adjustment value to `mult`. The main formula by which we convert cycles to the nanoseconds:
|
||||
|
||||
```C
|
||||
((u64) cycles * mult) >> shift;
|
||||
```
|
||||
|
||||
is not `100%` accurate. Instead the number is taken as close as possible to a nanosecond and `maxadj` helps to correct this and allows clocksource API to avoid `mult` values that might overflow when adjusted. The next four fields are pointers to the function:
|
||||
|
||||
* `enable` - optional function to enable clocksource;
|
||||
* `disable` - optional function to disable clocksource;
|
||||
* `suspend` - suspend function for the clocksource;
|
||||
* `resume` - resume function for the clocksource;
|
||||
|
||||
The next field is the `max_cycles` and as we can understand from its name, this field represents maximum cycle value before potential overflow. And the last field is `owner` represents reference to a kernel [module](https://en.wikipedia.org/wiki/Loadable_kernel_module) that is owner of a clocksource. This is all. We just went through all the standard fields of the `clocksource` structure. But you can noted that we missed some fields of the `clocksource` structure. We can divide all of missed field on two types: Fields of the first type are already known for us. For example, they are `name` field that represents name of a `clocksource`, the `rating` field that helps to the Linux kernel to select the best clocksource and etc. The second type, fields which are dependent from the different Linux kernel configuration options. Let's look on these fields.
|
||||
|
||||
The first field is the `archdata`. This field has `arch_clocksource_data` type and depends on the `CONFIG_ARCH_CLOCKSOURCE_DATA` kernel configuration option. This field is actual only for the [x86](https://en.wikipedia.org/wiki/X86) and [IA64](https://en.wikipedia.org/wiki/IA-64) architectures for this moment. And again, as we can understand from the field's name, it represents architecture-specific data for a clock source. For example, it represents `vDSO` clock mode:
|
||||
|
||||
```C
|
||||
struct arch_clocksource_data {
|
||||
int vclock_mode;
|
||||
};
|
||||
```
|
||||
|
||||
for the `x86` architectures. Where the `vDSO` clock mode can be one of the:
|
||||
|
||||
```C
|
||||
#define VCLOCK_NONE 0
|
||||
#define VCLOCK_TSC 1
|
||||
#define VCLOCK_HPET 2
|
||||
#define VCLOCK_PVCLOCK 3
|
||||
```
|
||||
|
||||
The last three fields are `wd_list`, `cs_last` and the `wd_last` depends on the `CONFIG_CLOCKSOURCE_WATCHDOG` kernel configuration option. First of all let's try to understand what is it `watchdog`. In a simple words, watchdog is a timer that is used for detection of the computer malfunctions and recovering from it. All of these three fields contain watchdog related data that is used by the `clocksource` framework. If we will grep the Linux kernel source code, we will see that only [arch/x86/KConfig](https://github.com/torvalds/linux/blob/master/arch/x86/Kconfig#L54) kernel configuration file contains the `CONFIG_CLOCKSOURCE_WATCHDOG` kernel configuration option. So, why do `x86` and `x86_64` need in [watchdog](https://en.wikipedia.org/wiki/Watchdog_timer)? You already may know that all `x86` processors has special 64-bit register - [time stamp counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter). This register contains number of [cycles](https://en.wikipedia.org/wiki/Clock_rate) since the reset. Sometimes the time stamp counter needs to be verified against another clock source. We will not see initialization of the `watchdog` timer in this part, before this we must learn more about timers.
|
||||
|
||||
That's all. From this moment we know all fields of the `clocksource` structure. This knowledge will help us to learn insides of the `clocksource` framework.
|
||||
|
||||
New clock source registration
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
We saw only one function from the `clocksource` framework in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html). This function was - `__clocksource_register`. This function defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/tree/master/include/linux/clocksource.h) header file and as we can understand from the function's name, main point of this function is to register new clocksource. If we will look on the implementation of the `__clocksource_register` function, we will see that it just makes call of the `__clocksource_register_scale` function and returns its result:
|
||||
|
||||
```C
|
||||
static inline int __clocksource_register(struct clocksource *cs)
|
||||
{
|
||||
return __clocksource_register_scale(cs, 1, 0);
|
||||
}
|
||||
```
|
||||
|
||||
Before we will see implementation of the `__clocksource_register_scale` function, we can see that `clocksource` provides additional API for a new clock source registration:
|
||||
|
||||
```C
|
||||
static inline int clocksource_register_hz(struct clocksource *cs, u32 hz)
|
||||
{
|
||||
return __clocksource_register_scale(cs, 1, hz);
|
||||
}
|
||||
|
||||
static inline int clocksource_register_khz(struct clocksource *cs, u32 khz)
|
||||
{
|
||||
return __clocksource_register_scale(cs, 1000, khz);
|
||||
}
|
||||
```
|
||||
|
||||
And all of these functions do the same. They return value of the `__clocksource_register_scale` function but with different set of parameters. The `__clocksource_register_scale` function defined in the [kernel/time/clocksource.c](https://github.com/torvalds/linux/tree/master/kernel/time/clocksource.c) source code file. To understand difference between these functions, let's look on the parameters of the `clocksource_register_khz` function. As we can see, this function takes three parameters:
|
||||
|
||||
* `cs` - clocksource to be installed;
|
||||
* `scale` - scale factor of a clock source. In other words, if we will multiply value of this parameter on frequency, we will get `hz` of a clocksource;
|
||||
* `freq` - clock source frequency divided by scale.
|
||||
|
||||
Now let's look on the implementation of the `__clocksource_register_scale` function:
|
||||
|
||||
```C
|
||||
int __clocksource_register_scale(struct clocksource *cs, u32 scale, u32 freq)
|
||||
{
|
||||
__clocksource_update_freq_scale(cs, scale, freq);
|
||||
mutex_lock(&clocksource_mutex);
|
||||
clocksource_enqueue(cs);
|
||||
clocksource_enqueue_watchdog(cs);
|
||||
clocksource_select();
|
||||
mutex_unlock(&clocksource_mutex);
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
First of all we can see that the `__clocksource_register_scale` function starts from the call of the `__clocksource_update_freq_scale` function that defined in the same source code file and updates given clock source with the new frequency. Let's look on the implementation of this function. In the first step we need to check given frequency and if it was not passed as `zero`, we need to calculate `mult` and `shift` parameters for the given clock source. Why do we need to check value of the `frequency`? Actually it can be zero. if you attentively looked on the implementation of the `__clocksource_register` function, you may have noticed that we passed `frequency` as `0`. We will do it only for some clock sources that have self defined `mult` and `shift` parameters. Look in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) and you will see that we saw calculation of the `mult` and `shift` for `jiffies`. The `__clocksource_update_freq_scale` function will do it for us for other clock sources.
|
||||
|
||||
So in the start of the `__clocksource_update_freq_scale` function we check the value of the `frequency` parameter and if is not zero we need to calculate `mult` and `shift` for the given clock source. Let's look on the `mult` and `shift` calculation:
|
||||
|
||||
```C
|
||||
void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq)
|
||||
{
|
||||
u64 sec;
|
||||
|
||||
if (freq) {
|
||||
sec = cs->mask;
|
||||
do_div(sec, freq);
|
||||
do_div(sec, scale);
|
||||
|
||||
if (!sec)
|
||||
sec = 1;
|
||||
else if (sec > 600 && cs->mask > UINT_MAX)
|
||||
sec = 600;
|
||||
|
||||
clocks_calc_mult_shift(&cs->mult, &cs->shift, freq,
|
||||
NSEC_PER_SEC / scale, sec * scale);
|
||||
}
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
Here we can see calculation of the maximum number of seconds which we can run before a clock source counter will overflow. First of all we fill the `sec` variable with the value of a clock source mask. Remember that a clock source's mask represents maximum amount of bits that are valid for the given clock source. After this, we can see two division operations. At first we divide our `sec` variable on a clock source frequency and then on scale factor. The `freq` parameter shows us how many timer interrupts will be occurred in one second. So, we divide `mask` value that represents maximum number of a counter (for example `jiffy`) on the frequency of a timer and will get the maximum number of seconds for the certain clock source. The second division operation will give us maximum number of seconds for the certain clock source depends on its scale factor which can be `1` hertz or `1` kilohertz (10^ Hz).
|
||||
|
||||
After we have got maximum number of seconds, we check this value and set it to `1` or `600` depends on the result at the next step. These values is maximum sleeping time for a clocksource in seconds. In the next step we can see call of the `clocks_calc_mult_shift`. Main point of this function is calculation of the `mult` and `shift` values for a given clock source. In the end of the `__clocksource_update_freq_scale` function we check that just calculated `mult` value of a given clock source will not cause overflow after adjustment, update the `max_idle_ns` and `max_cycles` values of a given clock source with the maximum nanoseconds that can be converted to a clock source counter and print result to the kernel buffer:
|
||||
|
||||
```C
|
||||
pr_info("%s: mask: 0x%llx max_cycles: 0x%llx, max_idle_ns: %lld ns\n",
|
||||
cs->name, cs->mask, cs->max_cycles, cs->max_idle_ns);
|
||||
```
|
||||
|
||||
that we can see in the [dmesg](https://en.wikipedia.org/wiki/Dmesg) output:
|
||||
|
||||
```
|
||||
$ dmesg | grep "clocksource:"
|
||||
[ 0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
|
||||
[ 0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484882848 ns
|
||||
[ 0.094084] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
|
||||
[ 0.205302] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
|
||||
[ 1.452979] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x7350b459580, max_idle_ns: 881591204237 ns
|
||||
```
|
||||
|
||||
After the `__clocksource_update_freq_scale` function will finish its work, we can return back to the `__clocksource_register_scale` function that will register new clock source. We can see the call of the following three functions:
|
||||
|
||||
```C
|
||||
mutex_lock(&clocksource_mutex);
|
||||
clocksource_enqueue(cs);
|
||||
clocksource_enqueue_watchdog(cs);
|
||||
clocksource_select();
|
||||
mutex_unlock(&clocksource_mutex);
|
||||
```
|
||||
|
||||
Note that before the first will be called, we lock the `clocksource_mutex` [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion). The point of the `clocksource_mutex` mutex is to protect `curr_clocksource` variable which represents currently selected `clocksource` and `clocksource_list` variable which represents list that contains registered `clocksources`. Now, let's look on these three functions.
|
||||
|
||||
The first `clocksource_enqueue` function and other two defined in the same source code [file](https://github.com/torvalds/linux/tree/master/kernel/time/clocksource.c). We go through all already registered `clocksources` or in other words we go through all elements of the `clocksource_list` and tries to find best place for a given `clocksource`:
|
||||
|
||||
```C
|
||||
static void clocksource_enqueue(struct clocksource *cs)
|
||||
{
|
||||
struct list_head *entry = &clocksource_list;
|
||||
struct clocksource *tmp;
|
||||
|
||||
list_for_each_entry(tmp, &clocksource_list, list)
|
||||
if (tmp->rating >= cs->rating)
|
||||
entry = &tmp->list;
|
||||
list_add(&cs->list, entry);
|
||||
}
|
||||
```
|
||||
|
||||
In the end we just insert new clocksource to the `clocksource_list`. The second function - `clocksource_enqueue_watchdog` does almost the same that previous function, but it inserts new clock source to the `wd_list` depends on flags of a clock source and starts new [watchdog](https://en.wikipedia.org/wiki/Watchdog_timer) timer. As I already wrote, we will not consider `watchdog` related stuff in this part but will do it in next parts.
|
||||
|
||||
The last function is the `clocksource_select`. As we can understand from the function's name, main point of this function - select the best `clocksource` from registered clocksources. This function consists only from the call of the function helper:
|
||||
|
||||
```C
|
||||
static void clocksource_select(void)
|
||||
{
|
||||
return __clocksource_select(false);
|
||||
}
|
||||
```
|
||||
|
||||
Note that the `__clocksource_select` function takes one parameter (`false` in our case). This [bool](https://en.wikipedia.org/wiki/Boolean_data_type) parameter shows how to traverse the `clocksource_list`. In our case we pass `false` that is meant that we will go through all entries of the `clocksource_list`. We already know that `clocksource` with the best rating will the first in the `clocksource_list` after the call of the `clocksource_enqueue` function, so we can easily get it from this list. After we found a clock source with the best rating, we switch to it:
|
||||
|
||||
```C
|
||||
if (curr_clocksource != best && !timekeeping_notify(best)) {
|
||||
pr_info("Switched to clocksource %s\n", best->name);
|
||||
curr_clocksource = best;
|
||||
}
|
||||
```
|
||||
|
||||
The result of this operation we can see in the `dmesg` output:
|
||||
|
||||
```
|
||||
$ dmesg | grep Switched
|
||||
[ 0.199688] clocksource: Switched to clocksource hpet
|
||||
[ 2.452966] clocksource: Switched to clocksource tsc
|
||||
```
|
||||
|
||||
Note that we can see two clock sources in the `dmesg` output (`hpet` and `tsc` in our case). Yes, actually there can be many different clock sources on a particular hardware. So the Linux kernel knows about all registered clock sources and switches to a clock source with a better rating each time after registration of a new clock source.
|
||||
|
||||
If we will look on the bottom of the [kernel/time/clocksource.c](https://github.com/torvalds/linux/tree/master/kernel/time/clocksource.c) source code file, we will see that it has [sysfs](https://en.wikipedia.org/wiki/Sysfs) interface. Main initialization occurs in the `init_clocksource_sysfs` function which will be called during device `initcalls`. Let's look on the implementation of the `init_clocksource_sysfs` function:
|
||||
|
||||
```C
|
||||
static struct bus_type clocksource_subsys = {
|
||||
.name = "clocksource",
|
||||
.dev_name = "clocksource",
|
||||
};
|
||||
|
||||
static int __init init_clocksource_sysfs(void)
|
||||
{
|
||||
int error = subsys_system_register(&clocksource_subsys, NULL);
|
||||
|
||||
if (!error)
|
||||
error = device_register(&device_clocksource);
|
||||
if (!error)
|
||||
error = device_create_file(
|
||||
&device_clocksource,
|
||||
&dev_attr_current_clocksource);
|
||||
if (!error)
|
||||
error = device_create_file(&device_clocksource,
|
||||
&dev_attr_unbind_clocksource);
|
||||
if (!error)
|
||||
error = device_create_file(
|
||||
&device_clocksource,
|
||||
&dev_attr_available_clocksource);
|
||||
return error;
|
||||
}
|
||||
device_initcall(init_clocksource_sysfs);
|
||||
```
|
||||
|
||||
First of all we can see that it registers a `clocksource` subsystem with the call of the `subsys_system_register` function. In other words, after the call of this function, we will have following directory:
|
||||
|
||||
```
|
||||
$ pwd
|
||||
/sys/devices/system/clocksource
|
||||
```
|
||||
|
||||
After this step, we can see registration of the `device_clocksource` device which is represented by the following structure:
|
||||
|
||||
```C
|
||||
static struct device device_clocksource = {
|
||||
.id = 0,
|
||||
.bus = &clocksource_subsys,
|
||||
};
|
||||
```
|
||||
|
||||
and creation of three files:
|
||||
|
||||
* `dev_attr_current_clocksource`;
|
||||
* `dev_attr_unbind_clocksource`;
|
||||
* `dev_attr_available_clocksource`.
|
||||
|
||||
These files will provide information about current clock source in the system, available clock sources in the system and interface which allows to unbind the clock source.
|
||||
|
||||
After the `init_clocksource_sysfs` function will be executed, we will be able find some information about available clock sources in the:
|
||||
|
||||
```
|
||||
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
|
||||
tsc hpet acpi_pm
|
||||
```
|
||||
|
||||
Or for example information about current clock source in the system:
|
||||
|
||||
```
|
||||
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
|
||||
tsc
|
||||
```
|
||||
|
||||
In the previous part, we saw API for the registration of the `jiffies` clock source, but didn't dive into details about the `clocksource` framework. In this part we did it and saw implementation of the new clock source registration and selection of a clock source with the best rating value in the system. Of course, this is not all API that `clocksource` framework provides. There a couple additional functions like `clocksource_unregister` for removing given clock source from the `clocksource_list` and etc. But I will not describe this functions in this part, because they are not important for us right now. Anyway if you are interesting in it, you can find it in the [kernel/time/clocksource.c](https://github.com/torvalds/linux/tree/master/kernel/time/clocksource.c).
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the second part of the chapter that describes timers and timer management related stuff in the Linux kernel. In the previous part got acquainted with the following two concepts: `jiffies` and `clocksource`. In this part we saw some examples of the `jiffies` usage and knew more details about the `clocksource` concept.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
* [x86](https://en.wikipedia.org/wiki/X86)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [uptime](https://en.wikipedia.org/wiki/Uptime)
|
||||
* [Ensoniq Soundscape Elite](https://en.wikipedia.org/wiki/Ensoniq_Soundscape_Elite)
|
||||
* [RTC](https://en.wikipedia.org/wiki/Real-time_clock)
|
||||
* [interrupts](https://en.wikipedia.org/wiki/Interrupt)
|
||||
* [IBM PC](https://en.wikipedia.org/wiki/IBM_Personal_Computer)
|
||||
* [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer)
|
||||
* [Hz](https://en.wikipedia.org/wiki/Hertz)
|
||||
* [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond)
|
||||
* [dmesg](https://en.wikipedia.org/wiki/Dmesg)
|
||||
* [time stamp counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter)
|
||||
* [loadable kernel module](https://en.wikipedia.org/wiki/Loadable_kernel_module)
|
||||
* [IA64](https://en.wikipedia.org/wiki/IA-64)
|
||||
* [watchdog](https://en.wikipedia.org/wiki/Watchdog_timer)
|
||||
* [clock rate](https://en.wikipedia.org/wiki/Clock_rate)
|
||||
* [mutex](https://en.wikipedia.org/wiki/Mutual_exclusion)
|
||||
* [sysfs](https://en.wikipedia.org/wiki/Sysfs)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html)
|
@ -0,0 +1,444 @@
|
||||
Timers and time management in the Linux kernel. Part 3.
|
||||
================================================================================
|
||||
|
||||
The tick broadcast framework and dyntick
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is third part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel and we stopped on the `clocksource` framework in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html). We have started to consider this framework because it is closely related to the special counters which are provided by the Linux kernel. One of these counters which we already saw in the first [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) of this chapter is - `jiffies`. As I already wrote in the first part of this chapter, we will consider time management related stuff step by step during the Linux kernel initialization. Previous step was call of the:
|
||||
|
||||
```C
|
||||
register_refined_jiffies(CLOCK_TICK_RATE);
|
||||
```
|
||||
|
||||
function which defined in the [kernel/time/jiffies.c](https://github.com/torvalds/linux/blob/master/kernel/time/jiffies.c) source code file and executes initialization of the `refined_jiffies` clock source for us. Recall that this function is called from the `setup_arch` function that defined in the [https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c](arch/x86/kernel/setup.c) source code and executes architecture-specific ([x86_64](https://en.wikipedia.org/wiki/X86-64) in our case) initialization. Look on the implementation of the `setup_arch` and you will note that the call of the `register_refined_jiffies` is the last step before the `setup_arch` function will finish its work.
|
||||
|
||||
There are many different `x86_64` specific things already configured after the end of the `setup_arch` execution. For example some early [interrupt](https://en.wikipedia.org/wiki/Interrupt) handlers already able to handle interrupts, memory space reserved for the [initrd](https://en.wikipedia.org/wiki/Initrd), [DMI](https://en.wikipedia.org/wiki/Desktop_Management_Interface) scanned, the Linux kernel log buffer is already set and this means that the [printk](https://en.wikipedia.org/wiki/Printk) function is able to work, [e820](https://en.wikipedia.org/wiki/E820) parsed and the Linux kernel already knows about available memory and and many many other architecture specific things (if you are interesting, you can read more about the `setup_arch` function and Linux kernel initialization process in the second [chapter](https://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) of this book).
|
||||
|
||||
Now, the `setup_arch` finished its work and we can back to the generic Linux kernel code. Recall that the `setup_arch` function was called from the `start_kernel` function which is defined in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file. So, we shall return to this function. You can see that there are many different function are called right after `setup_arch` function inside of the `start_kernel` function, but since our chapter is devoted to timers and time management related stuff, we will skip all code which is not related to this topic. The first function which is related to the time management in the Linux kernel is:
|
||||
|
||||
```C
|
||||
tick_init();
|
||||
```
|
||||
|
||||
in the `start_kernel`. The `tick_init` function defined in the [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-common.c) source code file and does two things:
|
||||
|
||||
* Initialization of `tick broadcast` framework related data structures;
|
||||
* Initialization of `full` tickless mode related data structures.
|
||||
|
||||
We didn't see anything related to the `tick broadcast` framework in this book and didn't know anything about tickless mode in the Linux kernel. So, the main point of this part is to look on these concepts and to know what are they.
|
||||
|
||||
The idle process
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
First of all, let's look on the implementation of the `tick_init` function. As I already wrote, this function defined in the [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-common.c) source code file and consists from the two calls of following functions:
|
||||
|
||||
```C
|
||||
void __init tick_init(void)
|
||||
{
|
||||
tick_broadcast_init();
|
||||
tick_nohz_init();
|
||||
}
|
||||
```
|
||||
|
||||
As you can understand from the paragraph's title, we are interesting only in the `tick_broadcast_init` function for now. This function defined in the [kernel/time/tick-broadcast.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-broadcast.c) source code file and executes initialization of the `tick broadcast` framework related data structures. Before we will look on the implementation of the `tick_broadcast_init` function and will try to understand what does this function do, we need to know about `tick broadcast` framework.
|
||||
|
||||
Main point of a central processor is to execute programs. But sometimes a processor may be in a special state when it is not being used by any program. This special state is called - [idle](https://en.wikipedia.org/wiki/Idle_%28CPU%29). When the processor has no anything to execute, the Linux kernel launches `idle` task. We already saw a little about this in the last part of the [Linux kernel initialization process](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-10.html). When the Linux kernel will finish all initialization processes in the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file, it will call the `rest_init` function from the same source code file. Main point of this function is to launch kernel `init` thread and the `kthreadd` thread, to call the `schedule` function to start task scheduling and to go to sleep by calling the `cpu_idle_loop` function that defined in the [kernel/sched/idle.c](https://github.com/torvalds/linux/blob/master/kernel/sched/idle.c) source code file.
|
||||
|
||||
The `cpu_idle_loop` function represents infinite loop which checks the need for rescheduling on each iteration. After the scheduler finds something to execute, the `idle` process will finish its work and the control will be moved to a new runnable task with the call of the `schedule_preempt_disabled` function:
|
||||
|
||||
```C
|
||||
static void cpu_idle_loop(void)
|
||||
{
|
||||
while (1) {
|
||||
while (!need_resched()) {
|
||||
...
|
||||
...
|
||||
...
|
||||
/* the main idle function */
|
||||
cpuidle_idle_call();
|
||||
}
|
||||
...
|
||||
...
|
||||
...
|
||||
schedule_preempt_disabled();
|
||||
}
|
||||
```
|
||||
|
||||
Of course, we will not consider full implementation of the `cpu_idle_loop` function and details of the `idle` state in this part, because it is not related to our topic. But there is one interesting moment for us. We know that the processor can execute only one task in one time. How does the Linux kernel decide to reschedule and stop `idle` process if the processor executes infinite loop in the `cpu_idle_loop`? The answer is system timer interrupts. When an interrupt occurs, the processor stops the `idle` thread and transfers control to an interrupt handler. After the system timer interrupt handler will be handled, the `need_resched` will return true and the Linux kernel will stop `idle` process and will transfer control to the current runnable task. But handling of the system timer interrupts is not effective for [power management](https://en.wikipedia.org/wiki/Power_management), because if a processor is in `idle` state, there is little point in sending it a system timer interrupt.
|
||||
|
||||
By default, there is the `CONFIG_HZ_PERIODIC` kernel configuration option which is enabled in the Linux kernel and tells to handle each interrupt of the system timer. To solve this problem, the Linux kernel provides two additional ways of managing scheduling-clock interrupts:
|
||||
|
||||
The first is to omit scheduling-clock ticks on idle processors. To enable this behaviour in the Linux kernel, we need to enable the `CONFIG_NO_HZ_IDLE` kernel configuration option. This option allows Linux kernel to avoid sending timer interrupts to idle processors. In this case periodic timer interrupts will be replaced with on-demand interrupts. This mode is called - `dyntick-idle` mode. But if the kernel does not handle interrupts of a system timer, how can the kernel decide if the system has nothing to do?
|
||||
|
||||
Whenever the idle task is selected to run, the periodic tick is disabled with the call of the `tick_nohz_idle_enter` function that defined in the [kernel/time/tick-sched.c](https://github.com/torvalds/linux/blob/master/kernel/time/tich-sched.c) source code file and enabled with the call of the `tick_nohz_idle_exit` function. There is special concept in the Linux kernel which is called - `clock event devices` that are used to schedule the next interrupt. This concept provides API for devices which can deliver interrupts at a specific time in the future and represented by the `clock_event_device` structure in the Linux kernel. We will not dive into implementation of the `clock_event_device` structure now. We will see it in the next prat of this chapter. But there is one interesting moment for us right now.
|
||||
|
||||
The second way is to omit scheduling-clock ticks on processors that are either in `idle` state or that have only one runnable task or in other words busy processor. We can enable this feature with the `CONFIG_NO_HZ_FULL` kernel configuration option and it allows to reduce the number of timer interrupts significantly.
|
||||
|
||||
Besides the `cpu_idle_loop`, idle processor can be in a sleeping state. The Linux kernel provides special `cpuidle` framework. Main point of this framework is to put an idle processor to sleeping states. The name of the set of these states is - `C-states`. But how does a processor will be woken if local timer is disabled? The linux kernel provides `tick broadcast` framework for this. The main point of this framework is assign a timer which is not affected by the `C-states`. This timer will wake a sleeping processor.
|
||||
|
||||
Now, after some theory we can return to the implementation of our function. Let's recall that the `tick_init` function just calls two following functions:
|
||||
|
||||
```C
|
||||
void __init tick_init(void)
|
||||
{
|
||||
tick_broadcast_init();
|
||||
tick_nohz_init();
|
||||
}
|
||||
```
|
||||
|
||||
Let's consider the first function. The first `tick_broadcast_init` function defined in the [kernel/time/tick-broadcast.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-broadcast.c) source code file and executes initialization of the `tick broadcast` framework related data structures. Let's look on the implementation of the `tick_broadcast_init` function:
|
||||
|
||||
```C
|
||||
void __init tick_broadcast_init(void)
|
||||
{
|
||||
zalloc_cpumask_var(&tick_broadcast_mask, GFP_NOWAIT);
|
||||
zalloc_cpumask_var(&tick_broadcast_on, GFP_NOWAIT);
|
||||
zalloc_cpumask_var(&tmpmask, GFP_NOWAIT);
|
||||
#ifdef CONFIG_TICK_ONESHOT
|
||||
zalloc_cpumask_var(&tick_broadcast_oneshot_mask, GFP_NOWAIT);
|
||||
zalloc_cpumask_var(&tick_broadcast_pending_mask, GFP_NOWAIT);
|
||||
zalloc_cpumask_var(&tick_broadcast_force_mask, GFP_NOWAIT);
|
||||
#endif
|
||||
}
|
||||
```
|
||||
|
||||
As we can see, the `tick_broadcast_init` function allocates different [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) with the help of the `zalloc_cpumask_var` function. The `zalloc_cpumask_var` function defined in the [lib/cpumask.c](https://github.com/torvalds/linux/blob/master/lib/cpumask.c) source code file and expands to the call of the following function:
|
||||
|
||||
```C
|
||||
bool zalloc_cpumask_var(cpumask_var_t *mask, gfp_t flags)
|
||||
{
|
||||
return alloc_cpumask_var(mask, flags | __GFP_ZERO);
|
||||
}
|
||||
```
|
||||
|
||||
Ultimately, the memory space will be allocated for the given `cpumask` with the certain flags with the help of the `kmalloc_node` function:
|
||||
|
||||
```C
|
||||
*mask = kmalloc_node(cpumask_size(), flags, node);
|
||||
```
|
||||
|
||||
Now let's look on the `cpumasks` that will be initialized in the `tick_broadcast_init` function. As we can see, the `tick_broadcast_init` function will initialize six `cpumasks`, and moreover, initialization of the last three `cpumasks` will be depended on the `CONFIG_TICK_ONESHOT` kernel configuration option.
|
||||
|
||||
The first three `cpumasks` are:
|
||||
|
||||
* `tick_broadcast_mask` - the bitmap which represents list of processors that are in a sleeping mode;
|
||||
* `tick_broadcast_on` - the bitmap that stores numbers of processors which are in a periodic broadcast state;
|
||||
* `tmpmask` - this bitmap for temporary usage.
|
||||
|
||||
As we already know, the next three `cpumasks` depends on the `CONFIG_TICK_ONESHOT` kernel configuration option. Actually each clock event devices can be in one of two modes:
|
||||
|
||||
* `periodic` - clock events devices that support periodic events;
|
||||
* `oneshot` - clock events devices that capable of issuing events that happen only once.
|
||||
|
||||
The linux kernel defines two mask for such clock events devices in the [include/linux/clockchips.h](https://github.com/torvalds/linux/blob/master/include/linux/clockchips.h) header file:
|
||||
|
||||
```C
|
||||
#define CLOCK_EVT_FEAT_PERIODIC 0x000001
|
||||
#define CLOCK_EVT_FEAT_ONESHOT 0x000002
|
||||
```
|
||||
|
||||
So, the last three `cpumasks` are:
|
||||
|
||||
* `tick_broadcast_oneshot_mask` - stores numbers of processors that must be notified;
|
||||
* `tick_broadcast_pending_mask` - stores numbers of processors that pending broadcast;
|
||||
* `tick_broadcast_force_mask` - stores numbers of processors with enforced broadcast.
|
||||
|
||||
We have initialized six `cpumasks` in the `tick broadcast` framework, and now we can proceed to implementation of this framework.
|
||||
|
||||
The `tick broadcast` framework
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Hardware may provide some clock source devices. When a processor sleeps and its local timer stopped, there must be additional clock source device that will handle awakening of a processor. The Linux kernel uses these `special` clock source devices which can raise an interrupt at a specified time. We already know that such timers called `clock events` devices in the Linux kernel. Besides `clock events` devices. Actually, each processor in the system has its own local timer which is programmed to issue interrupt at the time of the next deferred task. Also these timers can be programmed to do a periodical job, like updating `jiffies` and etc. These timers represented by the `tick_device` structure in the Linux kernel. This structure defined in the [kernel/time/tick-sched.h](https://github.com/torvalds/linux/blob/master/kernel/time/tick-sched.h) header file and looks:
|
||||
|
||||
```C
|
||||
struct tick_device {
|
||||
struct clock_event_device *evtdev;
|
||||
enum tick_device_mode mode;
|
||||
};
|
||||
```
|
||||
|
||||
Note, that the `tick_device` structure contains two fields. The first field - `evtdev` represents pointer to the `clock_event_device` structure that defined in the [include/linux/clockchips.h](https://github.com/torvalds/linux/blob/master/include/linux/clockchips.h) header file and represents descriptor of a clock event device. A `clock event` device allows to register an event that will happen in the future. As I already wrote, we will not consider `clock_event_device` structure and related API in this part, but will see it in the next part.
|
||||
|
||||
The second field of the `tick_device` structure represents mode of the `tick_device`. As we already know, the mode can be one of the:
|
||||
|
||||
```C
|
||||
num tick_device_mode {
|
||||
TICKDEV_MODE_PERIODIC,
|
||||
TICKDEV_MODE_ONESHOT,
|
||||
};
|
||||
```
|
||||
|
||||
Each `clock events` device in the system registers itself by the call of the `clockevents_register_device` function or `clockevents_config_and_register` function during initialization process of the Linux kernel. During the registration of a new `clock events` device, the Linux kernel calls the `tick_check_new_device` function that defined in the [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/tick-common.c) source code file and checks the given `clock events` device should be used by the Linux kernel. After all checks, the `tick_check_new_device` function executes a call of the:
|
||||
|
||||
```C
|
||||
tick_install_broadcast_device(newdev);
|
||||
```
|
||||
|
||||
function that checks that the given `clock event` device can be broadcast device and install it, if the given device can be broadcast device. Let's look on the implementation of the `tick_install_broadcast_device` function:
|
||||
|
||||
```C
|
||||
void tick_install_broadcast_device(struct clock_event_device *dev)
|
||||
{
|
||||
struct clock_event_device *cur = tick_broadcast_device.evtdev;
|
||||
|
||||
if (!tick_check_broadcast_device(cur, dev))
|
||||
return;
|
||||
|
||||
if (!try_module_get(dev->owner))
|
||||
return;
|
||||
|
||||
clockevents_exchange_device(cur, dev);
|
||||
|
||||
if (cur)
|
||||
cur->event_handler = clockevents_handle_noop;
|
||||
|
||||
tick_broadcast_device.evtdev = dev;
|
||||
|
||||
if (!cpumask_empty(tick_broadcast_mask))
|
||||
tick_broadcast_start_periodic(dev);
|
||||
|
||||
if (dev->features & CLOCK_EVT_FEAT_ONESHOT)
|
||||
tick_clock_notify();
|
||||
}
|
||||
```
|
||||
|
||||
First of all we get the current `clock event` device from the `tick_broadcast_device`. The `tick_broadcast_device` defined in the [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/tick-common.c) source code file:
|
||||
|
||||
```C
|
||||
static struct tick_device tick_broadcast_device;
|
||||
```
|
||||
|
||||
and represents external clock device that keeps track of events for a processor. The first step after we got the current clock device is the call of the `tick_check_broadcast_device` function which checks that a given clock events device can be utilized as broadcast device. The main point of the `tick_check_broadcast_device` function is to check value of the `features` field of the given `clock events` device. As we can understand from the name of this field, the `features` field contains a clock event device features. Available values defined in the [include/linux/clockchips.h](https://github.com/torvalds/linux/blob/master/include/linux/clockchips.h) header file and can be one of the `CLOCK_EVT_FEAT_PERIODIC` - which represents a clock events device which supports periodic events and etc. So, the `tick_check_broadcast_device` function check `features` flags for `CLOCK_EVT_FEAT_ONESHOT`, `CLOCK_EVT_FEAT_DUMMY` and other flags and returns `false` if the given clock events device has one of these features. In other way the `tick_check_broadcast_device` function compares `ratings` of the given clock event device and current clock event device and returns the best.
|
||||
|
||||
After the `tick_check_broadcast_device` function, we can see the call of the `try_module_get` function that checks module owner of the clock events. We need to do it to be sure that the given `clock events` device was correctly initialized. The next step is the call of the `clockevents_exchange_device` function that defined in the [kernel/time/clockevents.c](https://github.com/torvalds/linux/blob/master/kernel/time/clockevents.c) source code file and will release old clock events device and replace the previous functional handler with a dummy handler.
|
||||
|
||||
In the last step of the `tick_install_broadcast_device` function we check that the `tick_broadcast_mask` is not empty and start the given `clock events` device in periodic mode with the call of the `tick_broadcast_start_periodic` function:
|
||||
|
||||
```C
|
||||
if (!cpumask_empty(tick_broadcast_mask))
|
||||
tick_broadcast_start_periodic(dev);
|
||||
|
||||
if (dev->features & CLOCK_EVT_FEAT_ONESHOT)
|
||||
tick_clock_notify();
|
||||
```
|
||||
|
||||
The `tick_broadcast_mask` filled in the `tick_device_uses_broadcast` function that checks a `clock events` device during registration of this `clock events` device:
|
||||
|
||||
```C
|
||||
int cpu = smp_processor_id();
|
||||
|
||||
int tick_device_uses_broadcast(struct clock_event_device *dev, int cpu)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
if (!tick_device_is_functional(dev)) {
|
||||
...
|
||||
cpumask_set_cpu(cpu, tick_broadcast_mask);
|
||||
...
|
||||
}
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
More about the `smp_processor_id` macro you can read in the fourth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process chapter.
|
||||
|
||||
The `tick_broadcast_start_periodic` function check the given `clock event` device and call the `tick_setup_periodic` function:
|
||||
|
||||
```
|
||||
static void tick_broadcast_start_periodic(struct clock_event_device *bc)
|
||||
{
|
||||
if (bc)
|
||||
tick_setup_periodic(bc, 1);
|
||||
}
|
||||
```
|
||||
|
||||
that defined in the [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-common.c) source code file and sets broadcast handler for the given `clock event` device by the call of the following function:
|
||||
|
||||
```C
|
||||
tick_set_periodic_handler(dev, broadcast);
|
||||
```
|
||||
|
||||
This function checks the second parameter which represents broadcast state (`on` or `off`) and sets the broadcast handler depends on its value:
|
||||
|
||||
```C
|
||||
void tick_set_periodic_handler(struct clock_event_device *dev, int broadcast)
|
||||
{
|
||||
if (!broadcast)
|
||||
dev->event_handler = tick_handle_periodic;
|
||||
else
|
||||
dev->event_handler = tick_handle_periodic_broadcast;
|
||||
}
|
||||
```
|
||||
|
||||
When an `clock event` device will issue an interrupt, the `dev->event_handler` will be called. For example, let's look on the interrupt handler of the [high precision event timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) which is located in the [arch/x86/kernel/hpet.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/hpet.c) source code file:
|
||||
|
||||
```C
|
||||
static irqreturn_t hpet_interrupt_handler(int irq, void *data)
|
||||
{
|
||||
struct hpet_dev *dev = (struct hpet_dev *)data;
|
||||
struct clock_event_device *hevt = &dev->evt;
|
||||
|
||||
if (!hevt->event_handler) {
|
||||
printk(KERN_INFO "Spurious HPET timer interrupt on HPET timer %d\n",
|
||||
dev->num);
|
||||
return IRQ_HANDLED;
|
||||
}
|
||||
|
||||
hevt->event_handler(hevt);
|
||||
return IRQ_HANDLED;
|
||||
}
|
||||
```
|
||||
|
||||
The `hpet_interrupt_handler` gets the [irq](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) specific data and check the event handler of the `clock event` device. Recall that we just set in the `tick_set_periodic_handler` function. So the `tick_handler_periodic_broadcast` function will be called in the end of the high precision event timer interrupt handler.
|
||||
|
||||
The `tick_handler_periodic_broadcast` function calls the
|
||||
|
||||
```C
|
||||
bc_local = tick_do_periodic_broadcast();
|
||||
```
|
||||
|
||||
function which stores numbers of processors which have asked to be woken up in the temporary `cpumask` and call the `tick_do_broadcast` function:
|
||||
|
||||
```
|
||||
cpumask_and(tmpmask, cpu_online_mask, tick_broadcast_mask);
|
||||
return tick_do_broadcast(tmpmask);
|
||||
```
|
||||
|
||||
The `tick_do_broadcast` calls the `broadcast` function of the given clock events which sends [IPI](https://en.wikipedia.org/wiki/Inter-processor_interrupt) interrupt to the set of the processors. In the end we can call the event handler of the given `tick_device`:
|
||||
|
||||
```C
|
||||
if (bc_local)
|
||||
td->evtdev->event_handler(td->evtdev);
|
||||
```
|
||||
|
||||
which actually represents interrupt handler of the local timer of a processor. After this a processor will wake up. That is all about `tick broadcast` framework in the Linux kernel. We have missed some aspects of this framework, for example reprogramming of a `clock event` device and broadcast with the oneshot timer and etc. But the Linux kernel is very big, it is not real to cover all aspects of it. I think it will be interesting to dive into with yourself.
|
||||
|
||||
If you remember, we have started this part with the call of the `tick_init` function. We just consider the `tick_broadcast_init` function and releated theory, but the `tick_init` function contains another call of a function and this function is - `tick_nohz_init`. Let's look on the implementation of this function.
|
||||
|
||||
Initialization of dyntick related data structures
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
We already saw some information about `dyntick` concept in this part and we know that this concept allows kernel to disable system timer interrupts in the `idle` state. The `tick_nohz_init` function makes initialization of the different data structures which are related to this concept. This function defined in the [kernel/time/tick-sched.c](https://github.com/torvalds/linux/blob/master/kernel/time/tich-sched.c) source code file and starts from the check of the value of the `tick_nohz_full_running` variable which represents state of the tick-less mode for the `idle` state and the state when system timer interrups are disabled during a processor has only one runnable task:
|
||||
|
||||
```C
|
||||
if (!tick_nohz_full_running) {
|
||||
if (tick_nohz_init_all() < 0)
|
||||
return;
|
||||
}
|
||||
```
|
||||
|
||||
If this mode is not running we call the `tick_nohz_init_all` function that defined in the same source code file and check its result. The `tick_nohz_init_all` function tries to allocate the `tick_nohz_full_mask` with the call of the `alloc_cpumask_var` that will allocate space for a `tick_nohz_full_mask`. The `tck_nohz_full_mask` will store numbers of processors that have enabled full `NO_HZ`. After successful allocation of the `tick_nohz_full_mask` we set all bits in the `tick_nogz_full_mask`, set the `tick_nohz_full_running` and return result to the `tick_nohz_init` function:
|
||||
|
||||
```C
|
||||
static int tick_nohz_init_all(void)
|
||||
{
|
||||
int err = -1;
|
||||
#ifdef CONFIG_NO_HZ_FULL_ALL
|
||||
if (!alloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
|
||||
WARN(1, "NO_HZ: Can't allocate full dynticks cpumask\n");
|
||||
return err;
|
||||
}
|
||||
err = 0;
|
||||
cpumask_setall(tick_nohz_full_mask);
|
||||
tick_nohz_full_running = true;
|
||||
#endif
|
||||
return err;
|
||||
}
|
||||
```
|
||||
|
||||
In the next step we try to allocate a memory space for the `housekeeping_mask`:
|
||||
|
||||
```C
|
||||
if (!alloc_cpumask_var(&housekeeping_mask, GFP_KERNEL)) {
|
||||
WARN(1, "NO_HZ: Can't allocate not-full dynticks cpumask\n");
|
||||
cpumask_clear(tick_nohz_full_mask);
|
||||
tick_nohz_full_running = false;
|
||||
return;
|
||||
}
|
||||
```
|
||||
|
||||
This `cpumask` will store number of processor for `housekeeping` or in other words we need at least in one processor that will not be in `NO_HZ` mode, because it will do timekeeping and etc. After this we check the result of the architecture-specific `arch_irq_work_has_interrupt` function. This function checks ability to send inter-processor interrupt for the certain architecture. We need to check this, because system timer of a processor will be disabled during `NO_HZ` mode, so there must be at least one online processor which can send inter-processor interrupt to awake offline processor. This function defined in the [arch/x86/include/asm/irq_work.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irq_work.h) header file for the [x86_64](https://en.wikipedia.org/wiki/X86-64) and just checks that a processor has [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) from the [CPUID](https://en.wikipedia.org/wiki/CPUID):
|
||||
|
||||
```C
|
||||
static inline bool arch_irq_work_has_interrupt(void)
|
||||
{
|
||||
return cpu_has_apic;
|
||||
}
|
||||
```
|
||||
|
||||
If a processor has not `APIC`, the Linux kernel prints warning message, clears the `tick_nohz_full_mask` cpumask, copies numbers of all possible processors in the system to the `housekeeping_mask` and resets the value of the `tick_nohz_full_running` variable:
|
||||
|
||||
```C
|
||||
if (!arch_irq_work_has_interrupt()) {
|
||||
pr_warning("NO_HZ: Can't run full dynticks because arch doesn't "
|
||||
"support irq work self-IPIs\n");
|
||||
cpumask_clear(tick_nohz_full_mask);
|
||||
cpumask_copy(housekeeping_mask, cpu_possible_mask);
|
||||
tick_nohz_full_running = false;
|
||||
return;
|
||||
}
|
||||
```
|
||||
|
||||
After this step, we get the number of the current processor by the call of the `smp_processor_id` and check this processor in the `tick_nohz_full_mask`. If the `tick_nohz_full_mask` contains a given processor we clear appropriate bit in the `tick_nohz_full_mask`:
|
||||
|
||||
```C
|
||||
cpu = smp_processor_id();
|
||||
|
||||
if (cpumask_test_cpu(cpu, tick_nohz_full_mask)) {
|
||||
pr_warning("NO_HZ: Clearing %d from nohz_full range for timekeeping\n", cpu);
|
||||
cpumask_clear_cpu(cpu, tick_nohz_full_mask);
|
||||
}
|
||||
```
|
||||
|
||||
Because this processor will be used for timekeeping. After this step we put all numbers of processors that are in the `cpu_possible_mask` and not in the `tick_nohz_full_mask`:
|
||||
|
||||
```C
|
||||
cpumask_andnot(housekeeping_mask,
|
||||
cpu_possible_mask, tick_nohz_full_mask);
|
||||
```
|
||||
|
||||
After this operation, the `housekeeping_mask` will contain all processors of the system except a processor for timekeeping. In the last step of the `tick_nohz_init_all` function, we are going through all processors that are defined in the `tick_nohz_full_mask` and call the following function for an each processor:
|
||||
|
||||
```C
|
||||
for_each_cpu(cpu, tick_nohz_full_mask)
|
||||
context_tracking_cpu_set(cpu);
|
||||
```
|
||||
|
||||
The `context_tracking_cpu_set` function defined in the [kernel/context_tracking.c](https://github.com/torvalds/linux/blob/master/kernel/context_tracking.c) source code file and main point of this function is to set the `context_tracking.active` [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable to `true`. When the `active` field will be set to `true` for the certain processor, all [context switches](https://en.wikipedia.org/wiki/Context_switch) will be ignored by the Linux kernel context tracking subsystem for this processor.
|
||||
|
||||
That's all. This is the end of the `tick_nohz_init` function. After this `NO_HZ` related data structures will be initialzed. We didn't see API of the `NO_HZ` mode, but will see it soon.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the third part of the chapter that describes timers and timer management related stuff in the Linux kernel. In the previous part got acquainted with the `clocksource` concept in the Linux kernel which represents framework for managing different clock source in a interrupt and hardware characteristics independent way. We continued to look on the Linux kernel initialization process in a time management context in this part and got acquainted with two new concepts for us: the `tick broadcast` framework and `tick-less` mode. The first concept helps the Linux kernel to deal with processors which are in deep sleep and the second concept represents the mode in which kernel may work to improve power management of `idle` processors.
|
||||
|
||||
In the next part we will continue to dive into timer management related things in the Linux kernel and will see new concept for us - `timers`.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [initrd](https://en.wikipedia.org/wiki/Initrd)
|
||||
* [interrupt](https://en.wikipedia.org/wiki/Interrupt)
|
||||
* [DMI](https://en.wikipedia.org/wiki/Desktop_Management_Interface)
|
||||
* [printk](https://en.wikipedia.org/wiki/Printk)
|
||||
* [CPU idle](https://en.wikipedia.org/wiki/Idle_%28CPU%29)
|
||||
* [power management](https://en.wikipedia.org/wiki/Power_management)
|
||||
* [NO_HZ documentation](https://github.com/torvalds/linux/blob/master/Documentation/timers/NO_HZ.txt)
|
||||
* [cpumasks](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [high precision event timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer)
|
||||
* [irq](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
|
||||
* [IPI](https://en.wikipedia.org/wiki/Inter-processor_interrupt)
|
||||
* [CPUID](https://en.wikipedia.org/wiki/CPUID)
|
||||
* [APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)
|
||||
* [percpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [context switches](https://en.wikipedia.org/wiki/Context_switch)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html)
|
@ -0,0 +1,427 @@
|
||||
Timers and time management in the Linux kernel. Part 4.
|
||||
================================================================================
|
||||
|
||||
Timers
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is fourth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-3.html) we knew about the `tick broadcast` framework and `NO_HZ` mode in the Linux kernel. We will continue to dive into the time management related stuff in the Linux kernel in this part and will be acquainted with yet another concept in the Linux kernel - `timers`. Before we will look at timers in the Linux kernel, we have to learn some theory about this concept. Note that we will consider software timers in this part.
|
||||
|
||||
The Linux kernel provides a `software timer` concept to allow to kernel functions could be invoked at future moment. Timers are widely used in the Linux kernel. For example, look in the [net/netfilter/ipset/ip_set_list_set.c](https://github.com/torvalds/linux/blob/master/net/netfilter/ipset/ip_set_list_set.c) source code file. This source code file provides implementation of the framework for the managing of groups of [IP](https://en.wikipedia.org/wiki/Internet_Protocol) addresses.
|
||||
|
||||
We can find the `list_set` structure that contains `gc` filed in this source code file:
|
||||
|
||||
```C
|
||||
struct list_set {
|
||||
...
|
||||
struct timer_list gc;
|
||||
...
|
||||
};
|
||||
```
|
||||
|
||||
Not that the `gc` filed has `timer_list` type. This structure defined in the [include/linux/timer.h](https://github.com/torvalds/linux/blob/master/include/linux/timer.h) header file and main point of this structure is to store `dynamic` timers in the Linux kernel. Actually, the Linux kernel provides two types of timers called dynamic timers and interval timers. First type of timers is used by the kernel, and the second can be used by user mode. The `timer_list` structure contains actual `dynamic` timers. The `list_set` contains `gc` timer in our example represents timer for garbage collection. This timer will be initialized in the `list_set_gc_init` function:
|
||||
|
||||
```C
|
||||
static void
|
||||
list_set_gc_init(struct ip_set *set, void (*gc)(unsigned long ul_set))
|
||||
{
|
||||
struct list_set *map = set->data;
|
||||
...
|
||||
...
|
||||
...
|
||||
map->gc.function = gc;
|
||||
map->gc.expires = jiffies + IPSET_GC_PERIOD(set->timeout) * HZ;
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
A function that is pointed by the `gc` pointer, will be called after timeout which is equal to the `map->gc.expires`.
|
||||
|
||||
Ok, we will not dive into this example with the [netfilter](https://en.wikipedia.org/wiki/Netfilter), because this chapter is not about [network](https://en.wikipedia.org/wiki/Computer_network) related stuff. But we saw that timers are widely used in the Linux kernel and learned that they represent concept which allows to functions to be called in future.
|
||||
|
||||
Now let's continue to research source code of Linux kernel which is related to the timers and time management stuff as we did it in all previous chapters.
|
||||
|
||||
Introduction to dynamic timers in the Linux kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As I already wrote, we knew about the `tick broadcast` framework and `NO_HZ` mode in the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-3.html). They will be initialized in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file by the call of the `tick_init` function. If we will look at this source code file, we will see that the next time management related function is:
|
||||
|
||||
```C
|
||||
init_timers();
|
||||
```
|
||||
|
||||
This function defined in the [kernel/time/timer.c](https://github.com/torvalds/linux/blob/master/kernel/time/timer.c) source code file and contains calls of four functions:
|
||||
|
||||
```C
|
||||
void __init init_timers(void)
|
||||
{
|
||||
init_timer_cpus();
|
||||
init_timer_stats();
|
||||
timer_register_cpu_notifier();
|
||||
open_softirq(TIMER_SOFTIRQ, run_timer_softirq);
|
||||
}
|
||||
```
|
||||
|
||||
Let's look on implementation of each function. The first function is `init_timer_cpus` defined in the [same](https://github.com/torvalds/linux/blob/master/kernel/time/timer.c) source code file and just calls the `init_timer_cpu` function for each possible processor in the system:
|
||||
|
||||
```C
|
||||
static void __init init_timer_cpus(void)
|
||||
{
|
||||
int cpu;
|
||||
|
||||
for_each_possible_cpu(cpu)
|
||||
init_timer_cpu(cpu);
|
||||
}
|
||||
```
|
||||
|
||||
If you do not know or do not remember what is it a `possible` cpu, you can read the special [part](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) of this book which describes `cpumask` concept in the Linux kernel. In short words, a `possible` processor is a processor which can be plugged in anytime during the life of the system.
|
||||
|
||||
The `init_timer_cpu` function does main work for us, namely it executes initialization of the `tvec_base` structure for each processor. This structure defined in the [kernel/time/timer.c](https://github.com/torvalds/linux/blob/master/kernel/time/timer.c) source code file and stores data related to a `dynamic` timer for a certain processor. Let's look on the definition of this structure:
|
||||
|
||||
```C
|
||||
struct tvec_base {
|
||||
spinlock_t lock;
|
||||
struct timer_list *running_timer;
|
||||
unsigned long timer_jiffies;
|
||||
unsigned long next_timer;
|
||||
unsigned long active_timers;
|
||||
unsigned long all_timers;
|
||||
int cpu;
|
||||
bool migration_enabled;
|
||||
bool nohz_active;
|
||||
struct tvec_root tv1;
|
||||
struct tvec tv2;
|
||||
struct tvec tv3;
|
||||
struct tvec tv4;
|
||||
struct tvec tv5;
|
||||
} ____cacheline_aligned;
|
||||
```
|
||||
|
||||
The `thec_base` structure contains following fields: The `lock` for `tvec_base` protection, the next `running_timer` field points to the currently running timer for the certain processor, the `timer_jiffies` fields represents the earliest expiration time (it will be used by the Linux kernel to find already expired timers). The next field - `next_timer` contains the next pending timer for a next timer [interrupt](https://en.wikipedia.org/wiki/Interrupt) in a case when a processor goes to sleep and the `NO_HZ` mode is enabled in the Linux kernel. The `active_timers` field provides accounting of non-deferrable timers or in other words all timers that will not be stopped during a processor will go to sleep. The `all_timers` field tracks total number of timers or `active_timers` + deferrable timers. The `cpu` field represents number of a processor which owns timers. The `migration_enabled` and `nohz_active` fields are represent opportunity of timers migration to another processor and status of the `NO_HZ` mode respectively.
|
||||
|
||||
The last five fields of the `tvec_base` structure represent lists of dynamic timers. The first `tv1` field has:
|
||||
|
||||
```C
|
||||
#define TVR_SIZE (1 << TVR_BITS)
|
||||
#define TVR_BITS (CONFIG_BASE_SMALL ? 6 : 8)
|
||||
|
||||
...
|
||||
...
|
||||
...
|
||||
|
||||
struct tvec_root {
|
||||
struct hlist_head vec[TVR_SIZE];
|
||||
};
|
||||
```
|
||||
|
||||
type. Note that the value of the `TVR_SIZE` depends on the `CONFIG_BASE_SMALL` kernel configuration option:
|
||||
|
||||
![base small](http://s17.postimg.org/db3towlu7/base_small.png)
|
||||
|
||||
that reduces size of the kernel data structures if disabled. The `v1` is array that may contain `64` or `256` elements where an each element represents a dynamic timer that will decay within the next `255` system timer interrupts. Next three fields: `tv2`, `tv3` and `tv4` are lists with dynamic timers too, but they store dynamic timers which will decay the next `2^14 - 1`, `2^20 - 1` and `2^26` respectively. The last `tv5` field represents list which stores dynamic timers with a large expiring period.
|
||||
|
||||
So, now we saw the `tvec_base` structure and description of its fields and we can look on the implementation of the `init_timer_cpu` function. As I already wrote, this function defined in the [kernel/time/timer.c](https://github.com/torvalds/linux/blob/master/kernel/time/timer.c) source code file and executes initialization of the `tvec_bases`:
|
||||
|
||||
```C
|
||||
static void __init init_timer_cpu(int cpu)
|
||||
{
|
||||
struct tvec_base *base = per_cpu_ptr(&tvec_bases, cpu);
|
||||
|
||||
base->cpu = cpu;
|
||||
spin_lock_init(&base->lock);
|
||||
|
||||
base->timer_jiffies = jiffies;
|
||||
base->next_timer = base->timer_jiffies;
|
||||
}
|
||||
```
|
||||
|
||||
The `tvec_bases` represents [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable which represents main data structure for a dynamic timer for a given processor. This `per-cpu` variable defined in the same source code file:
|
||||
|
||||
```C
|
||||
static DEFINE_PER_CPU(struct tvec_base, tvec_bases);
|
||||
```
|
||||
|
||||
First of all we're getting the address of the `tvec_bases` for the given processor to `base` variable and as we got it, we are starting to initialize some of the `tvec_base` fields in the `init_timer_cpu` function. After initialization of the `per-cpu` dynamic timers with the [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) and the number of a possible processor, we need to initialize a `tstats_lookup_lock` [spinlock](https://en.wikipedia.org/wiki/Spinlock) in the `init_timer_stats` function:
|
||||
|
||||
```C
|
||||
void __init init_timer_stats(void)
|
||||
{
|
||||
int cpu;
|
||||
|
||||
for_each_possible_cpu(cpu)
|
||||
raw_spin_lock_init(&per_cpu(tstats_lookup_lock, cpu));
|
||||
}
|
||||
```
|
||||
|
||||
The `tstats_lookcup_lock` variable represents `per-cpu` raw spinlock:
|
||||
|
||||
```C
|
||||
static DEFINE_PER_CPU(raw_spinlock_t, tstats_lookup_lock);
|
||||
```
|
||||
|
||||
which will be used for protection of operation with statistics of timers that can be accessed through the [procfs](https://en.wikipedia.org/wiki/Procfs):
|
||||
|
||||
```C
|
||||
static int __init init_tstats_procfs(void)
|
||||
{
|
||||
struct proc_dir_entry *pe;
|
||||
|
||||
pe = proc_create("timer_stats", 0644, NULL, &tstats_fops);
|
||||
if (!pe)
|
||||
return -ENOMEM;
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
For example:
|
||||
|
||||
```
|
||||
$ cat /proc/timer_stats
|
||||
Timerstats sample period: 3.888770 s
|
||||
12, 0 swapper hrtimer_stop_sched_tick (hrtimer_sched_tick)
|
||||
15, 1 swapper hcd_submit_urb (rh_timer_func)
|
||||
4, 959 kedac schedule_timeout (process_timeout)
|
||||
1, 0 swapper page_writeback_init (wb_timer_fn)
|
||||
28, 0 swapper hrtimer_stop_sched_tick (hrtimer_sched_tick)
|
||||
22, 2948 IRQ 4 tty_flip_buffer_push (delayed_work_timer_fn)
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
The next step after initialization of the `tstats_lookup_lock` spinlock is the call of the `timer_register_cpu_notifier` function. This function depends on the `CONFIG_HOTPLUG_CPU` kernel configuration option which enables support for [hotplug](https://en.wikipedia.org/wiki/Hot_swapping) processors in the Linux kernel.
|
||||
|
||||
When a processor will be logically offlined, a notification will be sent to the Linux kernel with the `CPU_DEAD` or the `CPU_DEAD_FROZEN` event by the call of the `cpu_notifier` macro:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_HOTPLUG_CPU
|
||||
...
|
||||
...
|
||||
static inline void timer_register_cpu_notifier(void)
|
||||
{
|
||||
cpu_notifier(timer_cpu_notify, 0);
|
||||
}
|
||||
...
|
||||
...
|
||||
#else
|
||||
...
|
||||
...
|
||||
static inline void timer_register_cpu_notifier(void) { }
|
||||
...
|
||||
...
|
||||
#endif /* CONFIG_HOTPLUG_CPU */
|
||||
```
|
||||
|
||||
In this case the `timer_cpu_notify` will be called which checks an event type and will call the `migrate_timers` function:
|
||||
|
||||
```C
|
||||
static int timer_cpu_notify(struct notifier_block *self,
|
||||
unsigned long action, void *hcpu)
|
||||
{
|
||||
switch (action) {
|
||||
case CPU_DEAD:
|
||||
case CPU_DEAD_FROZEN:
|
||||
migrate_timers((long)hcpu);
|
||||
break;
|
||||
default:
|
||||
break;
|
||||
}
|
||||
|
||||
return NOTIFY_OK;
|
||||
}
|
||||
```
|
||||
|
||||
This chapter will not describe `hotplug` related events in the Linux kernel source code, but if you are interesting in such things, you can find implementation of the `migrate_timers` function in the [kernel/time/timer.c](https://github.com/torvalds/linux/blob/master/kernel/time/timer.c) source code file.
|
||||
|
||||
The last step in the `init_timers` function is the call of the:
|
||||
|
||||
```C
|
||||
open_softirq(TIMER_SOFTIRQ, run_timer_softirq);
|
||||
```
|
||||
|
||||
function. The `open_softirq` function may be already familar to you if you have read the ninth [part](https://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html) about the interrupts and interrupt handling in the Linux kernel. In short words, the `open_softirq` function defined in the [kernel/softirq.c](https://github.com/torvalds/linux/blob/master/kernel/softirq.c) source code file and executes initialization of the deferred interrupt handler.
|
||||
|
||||
In our case the deferred function is the `run_timer_softirq` function that is will be called after a hardware interrupt in the `do_IRQ` function which defined in the [arch/x86/kernel/irq.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/irq.c) source code file. The main point of this function is to handle a software dynamic timer. The Linux kernel does not do this thing during the hardware timer interrupt handling because this is time consuming operation.
|
||||
|
||||
Let's look on the implementation of the `run_timer_softirq` function:
|
||||
|
||||
```C
|
||||
static void run_timer_softirq(struct softirq_action *h)
|
||||
{
|
||||
struct tvec_base *base = this_cpu_ptr(&tvec_bases);
|
||||
|
||||
if (time_after_eq(jiffies, base->timer_jiffies))
|
||||
__run_timers(base);
|
||||
}
|
||||
```
|
||||
|
||||
At the beginning of the `run_timer_softirq` function we get a `dynamic` timer for a current processor and compares the current value of the [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) with the value of the `timer_jiffies` for the current structure by the call of the `time_after_eq` macro which is defined in the [include/linux/jiffies.h](https://github.com/torvalds/linux/blob/master/include/linux/jiffies.h) header file:
|
||||
|
||||
```C
|
||||
#define time_after_eq(a,b) \
|
||||
(typecheck(unsigned long, a) && \
|
||||
typecheck(unsigned long, b) && \
|
||||
((long)((a) - (b)) >= 0))
|
||||
```
|
||||
|
||||
Reclaim that the `timer_jiffies` field of the `tvec_base` structure represents the relative time when functions delayed by the given timer will be executed. So we compare these two values and if the current time represented by the `jiffies` is greater than `base->timer_jiffies`, we call the `__run_timers` function that defined in the same source code file. Let's look on the implementation of this function.
|
||||
|
||||
As I just wrote, the `__run_timers` function runs all expired timers for a given processor. This function starts from the acquiring of the `tvec_base's` lock to protect the `tvec_base` structure
|
||||
|
||||
```C
|
||||
static inline void __run_timers(struct tvec_base *base)
|
||||
{
|
||||
struct timer_list *timer;
|
||||
|
||||
spin_lock_irq(&base->lock);
|
||||
...
|
||||
...
|
||||
...
|
||||
spin_unlock_irq(&base->lock);
|
||||
}
|
||||
```
|
||||
|
||||
After this it starts the loop while the `timer_jiffies` will not be greater than the `jiffies`:
|
||||
|
||||
```C
|
||||
while (time_after_eq(jiffies, base->timer_jiffies)) {
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
We can find many different manipulations in the our loop, but the main point is to find expired timers and call delayed functions. First of all we need to calculate the `index` of the `base->tv1` list that stores the next timer to be handled with the following expression:
|
||||
|
||||
```C
|
||||
index = base->timer_jiffies & TVR_MASK;
|
||||
```
|
||||
|
||||
where the `TVR_MASK` is a mask for the getting of the `tvec_root->vec` elements. As we got the index with the next timer which must be handled we check its value. If the index is zero, we go through all lists in our cascade table `tv2`, `tv3` and etc., and rehashing it with the call of the `cascade` function:
|
||||
|
||||
```C
|
||||
if (!index &&
|
||||
(!cascade(base, &base->tv2, INDEX(0))) &&
|
||||
(!cascade(base, &base->tv3, INDEX(1))) &&
|
||||
!cascade(base, &base->tv4, INDEX(2)))
|
||||
cascade(base, &base->tv5, INDEX(3));
|
||||
```
|
||||
|
||||
After this we increase the value of the `base->timer_jiffies`:
|
||||
|
||||
```C
|
||||
++base->timer_jiffies;
|
||||
```
|
||||
|
||||
In the last step we are executing a corresponding function for each timer from the list in a following loop:
|
||||
|
||||
```C
|
||||
hlist_move_list(base->tv1.vec + index, head);
|
||||
|
||||
while (!hlist_empty(head)) {
|
||||
...
|
||||
...
|
||||
...
|
||||
timer = hlist_entry(head->first, struct timer_list, entry);
|
||||
fn = timer->function;
|
||||
data = timer->data;
|
||||
|
||||
spin_unlock(&base->lock);
|
||||
call_timer_fn(timer, fn, data);
|
||||
spin_lock(&base->lock);
|
||||
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
where the `call_timer_fn` just call the given function:
|
||||
|
||||
```C
|
||||
static void call_timer_fn(struct timer_list *timer, void (*fn)(unsigned long),
|
||||
unsigned long data)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
fn(data);
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
That's all. The Linux kernel has infrastructure for `dynamic timers` from this moment. We will not dive into this interesting theme. As I already wrote the `timers` is a [widely](http://lxr.free-electrons.com/ident?i=timer_list) used concept in the Linux kernel and nor one part, nor two parts will not cover understanding of such things how it implemented and how it works. But now we know about this concept, why does the Linux kernel needs in it and some data structures around it.
|
||||
|
||||
Now let's look usage of `dynamic timers` in the Linux kernel.
|
||||
|
||||
Usage of dynamic timers
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As you already can noted, if the Linux kernel provides a concept, it also provides API for managing of this concept and the `dynamic timers` concept is not exception here. To use a timer in the Linux kernel code, we must define a variable with a `timer_list` type. We can initialize our `timer_list` structure in two ways. The first is to use the `init_timer` macro that defined in the [include/linux/timer.h](https://github.com/torvalds/linux/blob/master/include/linux/timer.h) header file:
|
||||
|
||||
```C
|
||||
#define init_timer(timer) \
|
||||
__init_timer((timer), 0)
|
||||
|
||||
#define __init_timer(_timer, _flags) \
|
||||
init_timer_key((_timer), (_flags), NULL, NULL)
|
||||
```
|
||||
|
||||
where the `init_timer_key` function just calls the:
|
||||
|
||||
```C
|
||||
do_init_timer(timer, flags, name, key);
|
||||
```
|
||||
|
||||
function which fields the given `timer` with default values. The second way is to use the:
|
||||
|
||||
```C
|
||||
#define TIMER_INITIALIZER(_function, _expires, _data) \
|
||||
__TIMER_INITIALIZER((_function), (_expires), (_data), 0)
|
||||
```
|
||||
|
||||
macro which will initilize the given `timer_list` structure too.
|
||||
|
||||
After a `dynamic timer` is initialzed we can start this `timer` with the call of the:
|
||||
|
||||
```C
|
||||
void add_timer(struct timer_list * timer);
|
||||
```
|
||||
|
||||
function and stop it with the:
|
||||
|
||||
```C
|
||||
int del_timer(struct timer_list * timer);
|
||||
```
|
||||
|
||||
function.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the fourth part of the chapter that describes timers and timer management related stuff in the Linux kernel. In the previous part we got acquainted with the two new concepts: the `tick broadcast` framework and the `NO_HZ` mode. In this part we continued to dive into time managemented related stuff and got acquainted with the new concept - `dynamic timer` or software timer. We didn't saw implementation of a `dynamic timers` management code in details in this part but saw data structures and API around this concept.
|
||||
|
||||
In the next part we will continue to dive into timer management related things in the Linux kernel and will see new concept for us - `timers`.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
* [IP](https://en.wikipedia.org/wiki/Internet_Protocol)
|
||||
* [netfilter](https://en.wikipedia.org/wiki/Netfilter)
|
||||
* [network](https://en.wikipedia.org/wiki/Computer_network)
|
||||
* [cpumask](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [interrupt](https://en.wikipedia.org/wiki/Interrupt)
|
||||
* [jiffies](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html)
|
||||
* [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [spinlock](https://en.wikipedia.org/wiki/Spinlock)
|
||||
* [procfs](https://en.wikipedia.org/wiki/Procfs)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-3.html)
|
@ -0,0 +1,415 @@
|
||||
Timers and time management in the Linux kernel. Part 5.
|
||||
================================================================================
|
||||
|
||||
Introduction to the `clockevents` framework
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is fifth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel. As you might noted from the title of this part, the `clockevents` framework will be discussed. We already saw one framework in the [second](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html) part of this chapter. It was `clocksource` framework. Both of these frameworks represent timekeeping abstractions in the Linux kernel.
|
||||
|
||||
At first let's refresh your memory and try to remember what is it `clocksource` framework and and what its purpose. The main goal of the `clocksource` framework is to provide `timeline`. As described in the [documentation](https://github.com/0xAX/linux/blob/master/Documentation/timers/timekeeping.txt):
|
||||
|
||||
> For example issuing the command 'date' on a Linux system will eventually read the clock source to determine exactly what time it is.
|
||||
|
||||
The Linux kernel supports many different clock sources. You can find some of them in the [drivers/closksource](https://github.com/torvalds/linux/tree/master/drivers/clocksource). For example old good [Intel 8253](https://en.wikipedia.org/wiki/Intel_8253) - [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer) with `1193182` Hz frequency, yet another one - [ACPI PM](http://uefi.org/sites/default/files/resources/ACPI_5.pdf) timer with `3579545` Hz frequency. Besides the [drivers/closksource](https://github.com/torvalds/linux/tree/master/drivers/clocksource) directory, each architecture may provide own architecture-specific clock sources. For example [x86](https://en.wikipedia.org/wiki/X86) architecture provides [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer), or for example [powerpc](https://en.wikipedia.org/wiki/PowerPC) provides access to the processor timer through `timebase` register.
|
||||
|
||||
Each clock source provides monotonic atomic counter. As I already wrote, the Linux kernel supports a huge set of different clock source and each clock source has own parameters like [frequency](https://en.wikipedia.org/wiki/Frequency). The main goal of the `clocksource` framework is to provide [API](https://en.wikipedia.org/wiki/Application_programming_interface) to select best available clock source in the system i.e. a clock source with the highest frequency. Additional goal of the `clocksource` framework is to represent an atomic counter provided by a clock source in human units. In this time, nanoseconds are the favorite choice for the time value units of the given clock source in the Linux kernel.
|
||||
|
||||
The `clocksource` framework represented by the `clocksource` structure which is defined in the [include/linux/clocksource.h](https://github.com/torvalds/linux/blob/master/include/linux/clocksource.h) header code file which contains `name` of a clock source, rating of certain clock source in the system (a clock source with the higher frequency has the biggest rating in the system), `list` of all registered clock source in the system, `enable` and `disable` fields to enable and disable a clock source, pointer to the `read` function which must return an atomic counter of a clock source and etc.
|
||||
|
||||
Additionally the `clocksource` structure provides two fields: `mult` and `shift` which are needed for translation of an atomic counter which is provided by a certain clock source to the human units, i.e. [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond). Translation occurs via following formula:
|
||||
|
||||
```
|
||||
ns ~= (clocksource * mult) >> shift
|
||||
```
|
||||
|
||||
As we already know, besides the `clocksource` structure, the `clocksource` framework provides an API for registration of clock source with different frequency scale factor:
|
||||
|
||||
```C
|
||||
static inline int clocksource_register_hz(struct clocksource *cs, u32 hz)
|
||||
static inline int clocksource_register_khz(struct clocksource *cs, u32 khz)
|
||||
```
|
||||
|
||||
A clock source unregistration:
|
||||
|
||||
```C
|
||||
int clocksource_unregister(struct clocksource *cs)
|
||||
```
|
||||
|
||||
and etc.
|
||||
|
||||
Additionally to the `clocksource` framework, the Linux kernel provides `clockevents` framework. As described in the [documentation](https://github.com/0xAX/linux/blob/master/Documentation/timers/timekeeping.txt):
|
||||
|
||||
> Clock events are the conceptual reverse of clock sources
|
||||
|
||||
Main goal of the is to manage clock event devices or in other words - to manage devices that allow to register an event or in other words [interrupt](https://en.wikipedia.org/wiki/Interrupt) that is going to happen at a defined point of time in the future.
|
||||
|
||||
Now we know a little about the `clockevents` framework in the Linux kernel, and now time is to see on it [API](https://en.wikipedia.org/wiki/Application_programming_interface).
|
||||
|
||||
API of `clockevents` framework
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
The main structure which described a clock event device is `clock_event_device` structure. This structure is defined in the [include/linux/clockchips.h](https://github.com/torvalds/linux/blob/master/include/linux/clockchips.h) header file and contains a huge set of fields. as well as the `clocksource` structure it has `name` fields which contains human readable name of a clock event device, for example [local APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) timer:
|
||||
|
||||
```C
|
||||
static struct clock_event_device lapic_clockevent = {
|
||||
.name = "lapic",
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
Addresses of the `event_handler`, `set_next_event`, `next_event` functions for a certain clock event device which are an [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler), setter of next event and local storage for next event respectively. Yet another field of the `clock_event_device` structure is - `features` field. Its value maybe on of the following generic features:
|
||||
|
||||
```C
|
||||
#define CLOCK_EVT_FEAT_PERIODIC 0x000001
|
||||
#define CLOCK_EVT_FEAT_ONESHOT 0x000002
|
||||
```
|
||||
|
||||
Where the `CLOCK_EVT_FEAT_PERIODIC` represents device which may be programmed to generate events periodically. The `CLOCK_EVT_FEAT_ONESHOT` represents device which may generate an event only once. Besides these two features, there are also architecture-specific features. For example [x86_64](https://en.wikipedia.org/wiki/X86-64) supports two additional features:
|
||||
|
||||
```C
|
||||
#define CLOCK_EVT_FEAT_C3STOP 0x000008
|
||||
```
|
||||
|
||||
The first `CLOCK_EVT_FEAT_C3STOP` means that a clock event device will be stopped in the [C3](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface#Device_states) state. Additionally the `clock_event_device` structure has `mult` and `shift` fields as well as `clocksource` structure. The `clocksource` structure also contains other fields, but we will consider it later.
|
||||
|
||||
After we considered part of the `clock_event_device` structure, time is to look at the `API` of the `clockevents` framework. To work with a clock event device, first of all we need to initialize `clock_event_device` structure and register a clock events device. The `clockevents` framework provides following `API` for registration of clock event devices:
|
||||
|
||||
```C
|
||||
void clockevents_register_device(struct clock_event_device *dev)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
This function defined in the [kernel/time/clockevents.c](https://github.com/torvalds/linux/blob/master/kernel/time/clockevents.c) source code file and as we may see, the `clockevents_register_device` function takes only one parameter:
|
||||
|
||||
* address of a `clock_event_device` structure which represents a clock event device.
|
||||
|
||||
So, to register a clock event device, at first we need to initialize `clock_event_device` structure with parameters of a certain clock event device. Let's take a look at one random clock event device in the Linux kernel source code. We can find one in the [drivers/closksource](https://github.com/torvalds/linux/tree/master/drivers/clocksource) directory or try to take a look at an architecture-specific clock event device. Let's take for example - [Periodic Interval Timer (PIT) for at91sam926x](http://www.atmel.com/Images/doc6062.pdf). You can find its implementation in the [drivers/closksource](https://github.com/torvalds/linux/tree/master/drivers/clocksource/timer-atmel-pit.c).
|
||||
|
||||
First of all let's look at initialization of the `clock_event_device` structure. This occurs in the `at91sam926x_pit_common_init` function:
|
||||
|
||||
```C
|
||||
struct pit_data {
|
||||
...
|
||||
...
|
||||
struct clock_event_device clkevt;
|
||||
...
|
||||
...
|
||||
};
|
||||
|
||||
static void __init at91sam926x_pit_common_init(struct pit_data *data)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
data->clkevt.name = "pit";
|
||||
data->clkevt.features = CLOCK_EVT_FEAT_PERIODIC;
|
||||
data->clkevt.shift = 32;
|
||||
data->clkevt.mult = div_sc(pit_rate, NSEC_PER_SEC, data->clkevt.shift);
|
||||
data->clkevt.rating = 100;
|
||||
data->clkevt.cpumask = cpumask_of(0);
|
||||
|
||||
data->clkevt.set_state_shutdown = pit_clkevt_shutdown;
|
||||
data->clkevt.set_state_periodic = pit_clkevt_set_periodic;
|
||||
data->clkevt.resume = at91sam926x_pit_resume;
|
||||
data->clkevt.suspend = at91sam926x_pit_suspend;
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
Here we can see that `at91sam926x_pit_common_init` takes one parameter - pointer to the `pit_data` structure which contains `clock_event_device` structure which will contain clock event related information of the `at91sam926x` [periodic Interval Timer](https://en.wikipedia.org/wiki/Programmable_interval_timer). At the start we fill `name` of the timer device and its `features`. In our case we deal with periodic timer which as we already know may be programmed to generate events periodically.
|
||||
|
||||
The next two fields `shift` and `mult` are familiar to us. They will be used to translate counter of our timer to nanoseconds. After this we set rating of the timer to `100`. This means if there will not be timers with higher rating in the system, this timer will be used for timekeeping. The next field - `cpumask` indicates for which processors in the system the device will work. In our case, the device will work for the first processor. The `cpumask_of` macro defined in the [include/linux/cpumask.h](https://github.com/torvalds/linux/tree/master/include/linux/cpumask.h) header file and just expands to the call of the:
|
||||
|
||||
```C
|
||||
#define cpumask_of(cpu) (get_cpu_mask(cpu))
|
||||
```
|
||||
|
||||
Where the `get_cpu_mask` returns the cpumask containing just a given `cpu` number. More about `cpumasks` concept you may read in the [CPU masks in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) part. In the last four lines of code we set callbacks for the clock event device suspend/resume, device shutdown and update of the clock event device state.
|
||||
|
||||
After we finished with the initialization of the `at91sam926x` periodic timer, we can register it by the call of the following functions:
|
||||
|
||||
```C
|
||||
clockevents_register_device(&data->clkevt);
|
||||
```
|
||||
|
||||
Now we can consider implementation of the `clockevent_register_device` function. As I already wrote above, this function is defined in the [kernel/time/clockevents.c](https://github.com/torvalds/linux/blob/master/kernel/time/clockevents.c) source code file and starts from the initialization of the initial event device state:
|
||||
|
||||
```C
|
||||
clockevent_set_state(dev, CLOCK_EVT_STATE_DETACHED);
|
||||
```
|
||||
|
||||
Actually, an event device may be in one of this states:
|
||||
|
||||
```C
|
||||
enum clock_event_state {
|
||||
CLOCK_EVT_STATE_DETACHED,
|
||||
CLOCK_EVT_STATE_SHUTDOWN,
|
||||
CLOCK_EVT_STATE_PERIODIC,
|
||||
CLOCK_EVT_STATE_ONESHOT,
|
||||
CLOCK_EVT_STATE_ONESHOT_STOPPED,
|
||||
};
|
||||
```
|
||||
|
||||
Where:
|
||||
|
||||
* `CLOCK_EVT_STATE_DETACHED` - a clock event device is not not used by `clockevents` framework. Actually it is initial state of all clock event devices;
|
||||
* `CLOCK_EVT_STATE_SHUTDOWN` - a clock event device is powered-off;
|
||||
* `CLOCK_EVT_STATE_PERIODIC` - a clock event device may be programmed to generate event periodically;
|
||||
* `CLOCK_EVT_STATE_ONESHOT` - a clock event device may be programmed to generate event only once;
|
||||
* `CLOCK_EVT_STATE_ONESHOT_STOPPED` - a clock event device was programmed to generate event only once and now it is temporary stopped.
|
||||
|
||||
The implementation of the `clock_event_set_state` function is pretty easy:
|
||||
|
||||
```C
|
||||
static inline void clockevent_set_state(struct clock_event_device *dev,
|
||||
enum clock_event_state state)
|
||||
{
|
||||
dev->state_use_accessors = state;
|
||||
}
|
||||
```
|
||||
|
||||
As we can see, it just fills the `state_use_accessors` field of the given `clock_event_device` structure with the given value which is in our case is `CLOCK_EVT_STATE_DETACHED`. Actually all clock event devices has this initial state during registration. The `state_use_accessors` field of the `clock_event_device` structure provides `current` state of the clock event device.
|
||||
|
||||
After we have set initial state of the given `clock_event_device` structure we check that the `cpumask` of the given clock event device is not zero:
|
||||
|
||||
```C
|
||||
if (!dev->cpumask) {
|
||||
WARN_ON(num_possible_cpus() > 1);
|
||||
dev->cpumask = cpumask_of(smp_processor_id());
|
||||
}
|
||||
```
|
||||
|
||||
Remember that we have set the `cpumask` of the `at91sam926x` periodic timer to first processor. If the `cpumask` field is zero, we check the number of possible processors in the system and print warning message if it is less than on. Additionally we set the `cpumask` of the given clock event device to the current processor. If you are interested in how the `smp_processor_id` macro is implemented, you can read more about it in the fourth [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process chapter.
|
||||
|
||||
After this check we lock the actual code of the clock event device registration by the call following macros:
|
||||
|
||||
```C
|
||||
raw_spin_lock_irqsave(&clockevents_lock, flags);
|
||||
...
|
||||
...
|
||||
...
|
||||
raw_spin_unlock_irqrestore(&clockevents_lock, flags);
|
||||
```
|
||||
|
||||
Additionally the `raw_spin_lock_irqsave` and the `raw_spin_unlock_irqrestore` macros disable local interrupts, however interrupts on other processors still may occur. We need to do it to prevent potential [deadlock](https://en.wikipedia.org/wiki/Deadlock) if we adding new clock event device to the list of clock event devices and an interrupt occurs from other clock event device.
|
||||
|
||||
We can see following code of clock event device registration between the `raw_spin_lock_irqsave` and `raw_spin_unlock_irqrestore` macros:
|
||||
|
||||
```C
|
||||
list_add(&dev->list, &clockevent_devices);
|
||||
tick_check_new_device(dev);
|
||||
clockevents_notify_released();
|
||||
```
|
||||
|
||||
First of all we add the given clock event device to the list of clock event devices which is represented by the `clockevent_devices`:
|
||||
|
||||
```C
|
||||
static LIST_HEAD(clockevent_devices);
|
||||
```
|
||||
|
||||
At the next step we call the `tick_check_new_device` function which is defined in the [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-common.c) source code file and checks do the new registered clock event device should be used or not. The `tick_check_new_device` function checks the given `clock_event_device` gets the current registered tick device which is represented by the `tick_device` structure and compares their ratings and features. Actually `CLOCK_EVT_STATE_ONESHOT` is preferred:
|
||||
|
||||
```C
|
||||
static bool tick_check_preferred(struct clock_event_device *curdev,
|
||||
struct clock_event_device *newdev)
|
||||
{
|
||||
if (!(newdev->features & CLOCK_EVT_FEAT_ONESHOT)) {
|
||||
if (curdev && (curdev->features & CLOCK_EVT_FEAT_ONESHOT))
|
||||
return false;
|
||||
if (tick_oneshot_mode_active())
|
||||
return false;
|
||||
}
|
||||
|
||||
return !curdev ||
|
||||
newdev->rating > curdev->rating ||
|
||||
!cpumask_equal(curdev->cpumask, newdev->cpumask);
|
||||
}
|
||||
```
|
||||
|
||||
If the new registered clock event device is more preferred than old tick device, we exchange old and new registered devices and install new device:
|
||||
|
||||
```C
|
||||
clockevents_exchange_device(curdev, newdev);
|
||||
tick_setup_device(td, newdev, cpu, cpumask_of(cpu));
|
||||
```
|
||||
|
||||
The `clockevents_exchange_device` function releases or in other words deleted the old clock event device from the `clockevent_devices` list. The next function - `tick_setup_device` as we may understand from its name, setups new tick device. This function check the mode of the new registered clock event device and call the `tick_setup_periodic` function or the `tick_setup_oneshot` depends on the tick device mode:
|
||||
|
||||
```C
|
||||
if (td->mode == TICKDEV_MODE_PERIODIC)
|
||||
tick_setup_periodic(newdev, 0);
|
||||
else
|
||||
tick_setup_oneshot(newdev, handler, next_event);
|
||||
```
|
||||
|
||||
Both of this functions calls the `clockevents_switch_state` to change state of the clock event device and the `clockevents_program_event` function to set next event of clock event device based on delta between the maximum and minimum difference current time and time for the next event. The `tick_setup_periodic`:
|
||||
|
||||
```C
|
||||
clockevents_switch_state(dev, CLOCK_EVT_STATE_PERIODIC);
|
||||
clockevents_program_event(dev, next, false))
|
||||
```
|
||||
|
||||
and the `tick_setup_oneshot_periodic`:
|
||||
|
||||
```C
|
||||
clockevents_switch_state(newdev, CLOCK_EVT_STATE_ONESHOT);
|
||||
clockevents_program_event(newdev, next_event, true);
|
||||
```
|
||||
|
||||
The `clockevents_switch_state` function checks that the clock event device is not in the given state and calls the `__clockevents_switch_state` function from the same source code file:
|
||||
|
||||
```C
|
||||
if (clockevent_get_state(dev) != state) {
|
||||
if (__clockevents_switch_state(dev, state))
|
||||
return;
|
||||
```
|
||||
|
||||
The `__clockevents_switch_state` function just makes a call of the certain callback depends on the given state:
|
||||
|
||||
```C
|
||||
static int __clockevents_switch_state(struct clock_event_device *dev,
|
||||
enum clock_event_state state)
|
||||
{
|
||||
if (dev->features & CLOCK_EVT_FEAT_DUMMY)
|
||||
return 0;
|
||||
|
||||
switch (state) {
|
||||
case CLOCK_EVT_STATE_DETACHED:
|
||||
case CLOCK_EVT_STATE_SHUTDOWN:
|
||||
if (dev->set_state_shutdown)
|
||||
return dev->set_state_shutdown(dev);
|
||||
return 0;
|
||||
|
||||
case CLOCK_EVT_STATE_PERIODIC:
|
||||
if (!(dev->features & CLOCK_EVT_FEAT_PERIODIC))
|
||||
return -ENOSYS;
|
||||
if (dev->set_state_periodic)
|
||||
return dev->set_state_periodic(dev);
|
||||
return 0;
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
In our case for `at91sam926x` periodic timer, the state is the `CLOCK_EVT_FEAT_PERIODIC`:
|
||||
|
||||
```C
|
||||
data->clkevt.features = CLOCK_EVT_FEAT_PERIODIC;
|
||||
data->clkevt.set_state_periodic = pit_clkevt_set_periodic;
|
||||
```
|
||||
|
||||
So, for the `pit_clkevt_set_periodic` callback will be called. If we will read the documentation of the [Periodic Interval Timer (PIT) for at91sam926x](http://www.atmel.com/Images/doc6062.pdf), we will see that there is `Periodic Interval Timer Mode Register` which allows us to control of periodic interval timer.
|
||||
|
||||
It looks like:
|
||||
|
||||
```
|
||||
31 25 24
|
||||
+---------------------------------------------------------------+
|
||||
| | PITIEN | PITEN |
|
||||
+---------------------------------------------------------------+
|
||||
23 19 16
|
||||
+---------------------------------------------------------------+
|
||||
| | PIV |
|
||||
+---------------------------------------------------------------+
|
||||
15 8
|
||||
+---------------------------------------------------------------+
|
||||
| PIV |
|
||||
+---------------------------------------------------------------+
|
||||
7 0
|
||||
+---------------------------------------------------------------+
|
||||
| PIV |
|
||||
+---------------------------------------------------------------+
|
||||
```
|
||||
|
||||
Where `PIV` or `Periodic Interval Value` - defines the value compared with the primary `20-bit` counter of the Periodic Interval Timer. The `PITEN` or `Period Interval Timer Enabled` if the bit is `1` and the `PITIEN` or `Periodic Interval Timer Interrupt Enable` if the bit is `1`. So, to set periodic mode, we need to set `24`, `25` bits in the `Periodic Interval Timer Mode Register`. And we are doing it in the `pit_clkevt_set_periodic` function:
|
||||
|
||||
```C
|
||||
static int pit_clkevt_set_periodic(struct clock_event_device *dev)
|
||||
{
|
||||
struct pit_data *data = clkevt_to_pit_data(dev);
|
||||
...
|
||||
...
|
||||
...
|
||||
pit_write(data->base, AT91_PIT_MR,
|
||||
(data->cycle - 1) | AT91_PIT_PITEN | AT91_PIT_PITIEN);
|
||||
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
Where the `AT91_PT_MR`, `AT91_PT_PITEN` and the `AT91_PIT_PITIEN` are declared as:
|
||||
|
||||
```C
|
||||
#define AT91_PIT_MR 0x00
|
||||
#define AT91_PIT_PITIEN BIT(25)
|
||||
#define AT91_PIT_PITEN BIT(24)
|
||||
```
|
||||
|
||||
After the setup of the new clock event device is finished, we can return to the `clockevents_register_device` function. The last function in the `clockevents_register_device` function is:
|
||||
|
||||
```C
|
||||
clockevents_notify_released();
|
||||
```
|
||||
|
||||
This function checks the `clockevents_released` list which contains released clock event devices (remember that they may occur after the call of the ` clockevents_exchange_device` function). If this list is not empty, we go through clock event devices from the `clock_events_released` list and delete it from the `clockevent_devices`:
|
||||
|
||||
```C
|
||||
static void clockevents_notify_released(void)
|
||||
{
|
||||
struct clock_event_device *dev;
|
||||
|
||||
while (!list_empty(&clockevents_released)) {
|
||||
dev = list_entry(clockevents_released.next,
|
||||
struct clock_event_device, list);
|
||||
list_del(&dev->list);
|
||||
list_add(&dev->list, &clockevent_devices);
|
||||
tick_check_new_device(dev);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
That's all. From this moment we have registered new clock event device. So the usage of the `clockevents` framework is simple and clear. Architectures registered their clock event devices, in the clock events core. Users of the clockevents core can get clock event devices for their use. The `clockevents` framework provides notification mechanisms for various clock related management events like a clock event device registered or unregistered, a processor is offlined in system which supports [CPU hotplug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) and etc.
|
||||
|
||||
We saw implementation only of the `clockevents_register_device` function. But generally, the clock event layer [API](https://en.wikipedia.org/wiki/Application_programming_interface) is small. Besides the `API` for clock event device registration, the `clockevents` framework provides functions to schedule the next event interrupt, clock event device notification service and support for suspend and resume for clock event devices.
|
||||
|
||||
If you want to know more about `clockevents` API you can start to research following source code and header files: [kernel/time/tick-common.c](https://github.com/torvalds/linux/blob/master/kernel/time/tick-common.c), [kernel/time/clockevents.c](https://github.com/torvalds/linux/blob/master/kernel/time/clockevents.c) and [include/linux/clockchips.h](https://github.com/torvalds/linux/blob/master/include/linux/clockchips.h).
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the fifth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) that describes timers and timer management related stuff in the Linux kernel. In the previous part got acquainted with the `timers` concept. In this part we continued to learn time management related stuff in the Linux kernel and saw a little about yet another framework - `clockevents`.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
* [timekeeping documentation](https://github.com/0xAX/linux/blob/master/Documentation/timers/timekeeping.txt)
|
||||
* [Intel 8253](https://en.wikipedia.org/wiki/Intel_8253)
|
||||
* [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer)
|
||||
* [ACPI pdf](http://uefi.org/sites/default/files/resources/ACPI_5.pdf)
|
||||
* [x86](https://en.wikipedia.org/wiki/X86)
|
||||
* [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer)
|
||||
* [powerpc](https://en.wikipedia.org/wiki/PowerPC)
|
||||
* [frequency](https://en.wikipedia.org/wiki/Frequency)
|
||||
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
* [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond)
|
||||
* [interrupt](https://en.wikipedia.org/wiki/Interrupt)
|
||||
* [interrupt handler](https://en.wikipedia.org/wiki/Interrupt_handler)
|
||||
* [local APIC](https://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller)
|
||||
* [C3 state](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface#Device_states)
|
||||
* [Periodic Interval Timer (PIT) for at91sam926x](http://www.atmel.com/Images/doc6062.pdf)
|
||||
* [CPU masks in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [deadlock](https://en.wikipedia.org/wiki/Deadlock)
|
||||
* [CPU hotplug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-3.html)
|
@ -0,0 +1,413 @@
|
||||
Timers and time management in the Linux kernel. Part 6.
|
||||
================================================================================
|
||||
|
||||
x86_64 related clock sources
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is sixth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-5.html) we saw `clockevents` framework and now we will continue to dive into time management related stuff in the Linux kernel. This part will describe implementation of [x86](https://en.wikipedia.org/wiki/X86) architecture related clock sources (more about `clocksource` concept you can read in the [second part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html) of this chapter).
|
||||
|
||||
First of all we must know what clock sources may be used at `x86` architecture. It is easy to know from the [sysfs](https://en.wikipedia.org/wiki/Sysfs) or from content of the `/sys/devices/system/clocksource/clocksource0/available_clocksource`. The `/sys/devices/system/clocksource/clocksourceN` provides two special files to achieve this:
|
||||
|
||||
* `available_clocksource` - provides information about available clock sources in the system;
|
||||
* `current_clocksource` - provides information about currently used clock source in the system.
|
||||
|
||||
So, let's look:
|
||||
|
||||
```
|
||||
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
|
||||
tsc hpet acpi_pm
|
||||
```
|
||||
|
||||
We can see that there are three registered clock sources in my system:
|
||||
|
||||
* `tsc` - [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter);
|
||||
* `hpet` - [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer);
|
||||
* `acpi_pm` - [ACPI Power Management Timer](http://uefi.org/sites/default/files/resources/ACPI_5.pdf).
|
||||
|
||||
Now let's look at the second file which provides best clock source (a clock source which has the best rating in the system):
|
||||
|
||||
```
|
||||
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
|
||||
tsc
|
||||
```
|
||||
|
||||
For me it is [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter). As we may know from the [second part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-2.html) of this chapter, which describes internals of the `clocksource` framework in the Linux kernel, the best clock source in a system is a clock source with the best (highest) rating or in other words with the highest [frequency](https://en.wikipedia.org/wiki/Frequency).
|
||||
|
||||
Frequency of the [ACPI](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) power management timer is `3.579545 MHz`. Frequency of the [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) is at least `10 MHz`. And the frequency of the [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) depends on processor. For example On older processors, the `Time Stamp Counter` was counting internal processor clock cycles. This means its frequency changed when the processor's frequency scaling changed. The situation has changed for newer processors. Newer processors have an `invariant Time Stamp counter` that increments at a constant rate in all operational states of processor. Actually we can get its frequency in the output of the `/proc/cpuinfo`. For example for the first processor in the system:
|
||||
|
||||
```
|
||||
$ cat /proc/cpuinfo
|
||||
...
|
||||
model name : Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
|
||||
...
|
||||
```
|
||||
|
||||
And although Intel manual says that the frequency of the `Time Stamp Counter`, while constant, is not necessarily the maximum qualified frequency of the processor, or the frequency given in the brand string, anyway we may see that it will be much more than frequency of the `ACPI PM` timer or `High Precision Event Timer`. And we can see that the clock source with the best rating or highest frequency is current in the system.
|
||||
|
||||
You can note that besides these three clock source, we don't see yet another two familiar us clock sources in the output of the `/sys/devices/system/clocksource/clocksource0/available_clocksource`. These clock sources are `jiffy` and `refined_jiffies`. We don't see them because this filed maps only high resolution clock sources or in other words clock sources with the [CLOCK_SOURCE_VALID_FOR_HRES](https://github.com/torvalds/linux/blob/master/include/linux/clocksource.h#L113) flag.
|
||||
|
||||
As I already wrote above, we will consider all of these three clock sources in this part. We will consider it in order of their initialization or:
|
||||
|
||||
* `hpet`;
|
||||
* `acpi_pm`;
|
||||
* `tsc`.
|
||||
|
||||
We can make sure that the order is exactly like this in the output of the [dmesg](https://en.wikipedia.org/wiki/Dmesg) util:
|
||||
|
||||
```
|
||||
$ dmesg | grep clocksource
|
||||
[ 0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
|
||||
[ 0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484882848 ns
|
||||
[ 0.094369] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
|
||||
[ 0.186498] clocksource: Switched to clocksource hpet
|
||||
[ 0.196827] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
|
||||
[ 1.413685] tsc: Refined TSC clocksource calibration: 3999.981 MHz
|
||||
[ 1.413688] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x73509721780, max_idle_ns: 881591102108 ns
|
||||
[ 2.413748] clocksource: Switched to clocksource tsc
|
||||
```
|
||||
|
||||
The first clock source is the [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer), so let's start from it.
|
||||
|
||||
High Precision Event Timer
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The implementation of the `High Precision Event Timer` for the [x86](https://en.wikipedia.org/wiki/X86) architecture is located in the [arch/x86/kernel/hpet.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/hpet.c) source code file. Its initialization starts from the call of the `hpet_enable` function. This function is called during Linux kernel initialization. If we will look into `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file, we will see that after the all architecture-specific stuff initialized, early console is disabled and time management subsystem already ready, call of the following function:
|
||||
|
||||
```C
|
||||
if (late_time_init)
|
||||
late_time_init();
|
||||
```
|
||||
|
||||
which does initialization of the late architecture specific timers after early jiffy counter already initialized. The definition of the `late_time_init` function for the `x86` architecture is located in the [arch/x86/kernel/time.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/time.c) source code file. It looks pretty easy:
|
||||
|
||||
```C
|
||||
static __init void x86_late_time_init(void)
|
||||
{
|
||||
x86_init.timers.timer_init();
|
||||
tsc_init();
|
||||
}
|
||||
```
|
||||
|
||||
As we may see, it does initialization of the `x86` related timer and initialization of the `Time Stamp Counter`. The seconds we will see in the next paragraph, but now let's consider the call of the `x86_init.timers.timer_init` function. The `timer_init` points to the `hpet_time_init` function from the same source code file. We can verify this by looking on the definition of the `x86_init` structure from the [arch/x86/kernel/x86_init.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/x86_init.c):
|
||||
|
||||
```C
|
||||
struct x86_init_ops x86_init __initdata = {
|
||||
...
|
||||
...
|
||||
...
|
||||
.timers = {
|
||||
.setup_percpu_clockev = setup_boot_APIC_clock,
|
||||
.timer_init = hpet_time_init,
|
||||
.wallclock_init = x86_init_noop,
|
||||
},
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
The `hpet_time_init` function does setup of the [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer) if we can not enable `High Precision Event Timer` and setups default timer [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) for the enabled timer:
|
||||
|
||||
```C
|
||||
void __init hpet_time_init(void)
|
||||
{
|
||||
if (!hpet_enable())
|
||||
setup_pit_timer();
|
||||
setup_default_timer_irq();
|
||||
}
|
||||
```
|
||||
|
||||
First of all the `hpet_enable` function check we can enable `High Precision Event Timer` in the system by the call of the `is_hpet_capable` function and if we can, we map a virtual address space for it:
|
||||
|
||||
```C
|
||||
int __init hpet_enable(void)
|
||||
{
|
||||
if (!is_hpet_capable())
|
||||
return 0;
|
||||
|
||||
hpet_set_mapping();
|
||||
}
|
||||
```
|
||||
|
||||
The `is_hpet_capable` function checks that we didn't pass `hpet=disable` to the kernel command line and the `hpet_address` is received from the [ACPI HPET](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) table. The `hpet_set_mapping` function just maps the virtual address spaces for the timer registers:
|
||||
|
||||
```C
|
||||
hpet_virt_address = ioremap_nocache(hpet_address, HPET_MMAP_SIZE);
|
||||
```
|
||||
|
||||
As we can read in the [IA-PC HPET (High Precision Event Timers) Specification](http://www.intel.com/content/dam/www/public/us/en/documents/technical-specifications/software-developers-hpet-spec-1-0a.pdf):
|
||||
|
||||
> The timer register space is 1024 bytes
|
||||
|
||||
So, the `HPET_MMAP_SIZE` is `1024` bytes too:
|
||||
|
||||
```C
|
||||
#define HPET_MMAP_SIZE 1024
|
||||
```
|
||||
|
||||
After we mapped virtual space for the `High Precision Event Timer`, we read `HPET_ID` register to get number of the timers:
|
||||
|
||||
```C
|
||||
id = hpet_readl(HPET_ID);
|
||||
|
||||
last = (id & HPET_ID_NUMBER) >> HPET_ID_NUMBER_SHIFT;
|
||||
```
|
||||
|
||||
We need to get this number to allocate correct amount of space for the `General Configuration Register` of the `High Precision Event Timer`:
|
||||
|
||||
```C
|
||||
cfg = hpet_readl(HPET_CFG);
|
||||
|
||||
hpet_boot_cfg = kmalloc((last + 2) * sizeof(*hpet_boot_cfg), GFP_KERNEL);
|
||||
```
|
||||
|
||||
After the space is allocated for the configuration register of the `High Precision Event Timer`, we allow to main counter to run, and allow timer interrupts if they are enabled by the setting of `HPET_CFG_ENABLE` bit in the configuration register for all timers. In the end we just register new clock source by the call of the `hpet_clocksource_register` function:
|
||||
|
||||
```C
|
||||
if (hpet_clocksource_register())
|
||||
goto out_nohpet;
|
||||
```
|
||||
|
||||
which just calls already familiar
|
||||
|
||||
```C
|
||||
clocksource_register_hz(&clocksource_hpet, (u32)hpet_freq);
|
||||
```
|
||||
|
||||
function. Where the `clocksource_hpet` is the `clocksource` structure with the rating `250` (remember rating of the previous `refined_jiffies` clock source was `2`), name - `hpet` and `read_hpet` callback for the reading of atomic counter provided by the `High Precision Event Timer`:
|
||||
|
||||
```C
|
||||
static struct clocksource clocksource_hpet = {
|
||||
.name = "hpet",
|
||||
.rating = 250,
|
||||
.read = read_hpet,
|
||||
.mask = HPET_MASK,
|
||||
.flags = CLOCK_SOURCE_IS_CONTINUOUS,
|
||||
.resume = hpet_resume_counter,
|
||||
.archdata = { .vclock_mode = VCLOCK_HPET },
|
||||
};
|
||||
```
|
||||
|
||||
After the `clocksource_hpet` is registered, we can return to the `hpet_time_init()` function from the [arch/x86/kernel/time.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/time.c) source code file. We can remember that the last step is the call of the:
|
||||
|
||||
```C
|
||||
setup_default_timer_irq();
|
||||
```
|
||||
|
||||
function in the `hpet_time_init()`. The `setup_default_timer_irq` function checks existence of `legacy` IRQs or in other words support for the [i8259](https://en.wikipedia.org/wiki/Intel_8259) and setups [IRQ0](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29#Master_PIC) depends on this.
|
||||
|
||||
That's all. From this moment the [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) clock source registered in the Linux kernel `clock source` framework and may be used from generic kernel code via the `read_hpet`:
|
||||
```C
|
||||
static cycle_t read_hpet(struct clocksource *cs)
|
||||
{
|
||||
return (cycle_t)hpet_readl(HPET_COUNTER);
|
||||
}
|
||||
```
|
||||
|
||||
function which just reads and returns atomic counter from the `Main Counter Register`.
|
||||
|
||||
ACPI PM timer
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The seconds clock source is [ACPI Power Management Timer](http://uefi.org/sites/default/files/resources/ACPI_5.pdf). Implementation of this clock source is located in the [drivers/clocksource/acpi_pm.c](https://github.com/torvalds/linux/blob/master/drivers/clocksource_acpi_pm.c) source code file and starts from the call of the `init_acpi_pm_clocksource` function during `fs` [initcall](http://www.compsoc.man.ac.uk/~moz/kernelnewbies/documents/initcall/kernel.html).
|
||||
|
||||
If we will look at implementation of the `init_acpi_pm_clocksource` function, we will see that it starts from the check of the value of `pmtmr_ioport` variable:
|
||||
|
||||
```C
|
||||
static int __init init_acpi_pm_clocksource(void)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
if (!pmtmr_ioport)
|
||||
return -ENODEV;
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
This `pmtmr_ioport` variable contains extended address of the `Power Management Timer Control Register Block`. It gets its value in the `acpi_parse_fadt` function which is defined in the [arch/x86/kernel/acpi/boot.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/acpi/boot.c) source code file. This function parses `FADT` or `Fixed ACPI Description Table` [ACPI](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) table and tries to get the values of the `X_PM_TMR_BLK` field which contains extended address of the `Power Management Timer Control Register Block`, represented in `Generic Address Structure` format:
|
||||
|
||||
```C
|
||||
static int __init acpi_parse_fadt(struct acpi_table_header *table)
|
||||
{
|
||||
#ifdef CONFIG_X86_PM_TIMER
|
||||
...
|
||||
...
|
||||
...
|
||||
pmtmr_ioport = acpi_gbl_FADT.xpm_timer_block.address;
|
||||
...
|
||||
...
|
||||
...
|
||||
#endif
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
So, if the `CONFIG_X86_PM_TIMER` Linux kernel configuration option is disabled or something going wrong in the `acpi_parse_fadt` function, we can't access the `Power Management Timer` register and return from the `init_acpi_pm_clocksource`. In other way, if the value of the `pmtmr_ioport` variable is not zero, we check rate of this timer and register this clock source by the call of the:
|
||||
|
||||
```C
|
||||
clocksource_register_hz(&clocksource_acpi_pm, PMTMR_TICKS_PER_SEC);
|
||||
```
|
||||
|
||||
function. After the call of the `clocksource_register_hs`, the `acpi_pm` clock source will be registered in the `clocksource` framework of the Linux kernel:
|
||||
|
||||
```C
|
||||
static struct clocksource clocksource_acpi_pm = {
|
||||
.name = "acpi_pm",
|
||||
.rating = 200,
|
||||
.read = acpi_pm_read,
|
||||
.mask = (cycle_t)ACPI_PM_MASK,
|
||||
.flags = CLOCK_SOURCE_IS_CONTINUOUS,
|
||||
};
|
||||
```
|
||||
|
||||
with the rating - `200` and the `acpi_pm_read` callback to read atomic counter provided by the `acpi_pm` clock source. The `acpi_pm_read` function just executes `read_pmtmr` function:
|
||||
|
||||
```C
|
||||
static cycle_t acpi_pm_read(struct clocksource *cs)
|
||||
{
|
||||
return (cycle_t)read_pmtmr();
|
||||
}
|
||||
```
|
||||
|
||||
which reads value of the `Power Management Timer` register. This register has following structure:
|
||||
|
||||
```
|
||||
+-------------------------------+----------------------------------+
|
||||
| | |
|
||||
| upper eight bits of a | running count of the |
|
||||
| 32-bit power management timer | power management timer |
|
||||
| | |
|
||||
+-------------------------------+----------------------------------+
|
||||
31 E_TMR_VAL 24 TMR_VAL 0
|
||||
```
|
||||
|
||||
Address of this register is stored in the `Fixed ACPI Description Table` [ACPI](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) table and we already have it in the `pmtmr_ioport`. So, the implementation of the `read_pmtmr` function is pretty easy:
|
||||
|
||||
```C
|
||||
static inline u32 read_pmtmr(void)
|
||||
{
|
||||
return inl(pmtmr_ioport) & ACPI_PM_MASK;
|
||||
}
|
||||
```
|
||||
|
||||
We just read the value of the `Power Management Timer` register and mask its `24` bits.
|
||||
|
||||
That's all. Now we move to the last clock source in this part - `Time Stamp Counter`.
|
||||
|
||||
Time Stamp Counter
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The third and last clock source in this part is - [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) clock source and its implementation is located in the [arch/x86/kernel/tsc.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc.c) source code file. We already saw the `x86_late_time_init` function in this part and initialization of the [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter) starts from this place. This function calls the `tsc_init()` function from the [arch/x86/kernel/tsc.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc.c) source code file.
|
||||
|
||||
At the beginning of the `tsc_init` function we can see check, which checks that a processor has support of the `Time Stamp Counter`:
|
||||
|
||||
```C
|
||||
void __init tsc_init(void)
|
||||
{
|
||||
u64 lpj;
|
||||
int cpu;
|
||||
|
||||
if (!cpu_has_tsc) {
|
||||
setup_clear_cpu_cap(X86_FEATURE_TSC_DEADLINE_TIMER);
|
||||
return;
|
||||
}
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
The `cpu_has_tsc` macro expands to the call of the `cpu_has` macro:
|
||||
|
||||
```C
|
||||
#define cpu_has_tsc boot_cpu_has(X86_FEATURE_TSC)
|
||||
|
||||
#define boot_cpu_has(bit) cpu_has(&boot_cpu_data, bit)
|
||||
|
||||
#define cpu_has(c, bit) \
|
||||
(__builtin_constant_p(bit) && REQUIRED_MASK_BIT_SET(bit) ? 1 : \
|
||||
test_cpu_cap(c, bit))
|
||||
```
|
||||
|
||||
which check the given bit (the `X86_FEATURE_TSC_DEADLINE_TIMER` in our case) in the `boot_cpu_data` array which is filled during early Linux kernel initialization. If the processor has support of the `Time Stamp Counter`, we get the frequency of the `Time Stamp Counter` by the call of the `calibrate_tsc` function from the same source code file which tries to get frequency from the different source like [Model Specific Register](https://en.wikipedia.org/wiki/Model-specific_register), calibrate over [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer) and etc, after this we initialize frequency and scale factor for the all processors in the system:
|
||||
|
||||
```C
|
||||
tsc_khz = x86_platform.calibrate_tsc();
|
||||
cpu_khz = tsc_khz;
|
||||
|
||||
for_each_possible_cpu(cpu) {
|
||||
cyc2ns_init(cpu);
|
||||
set_cyc2ns_scale(cpu_khz, cpu);
|
||||
}
|
||||
```
|
||||
|
||||
because only first bootstrap processor will call the `tsc_init`. After this we check hat `Time Stamp Counter` is not disabled:
|
||||
|
||||
```
|
||||
if (tsc_disabled > 0)
|
||||
return;
|
||||
...
|
||||
...
|
||||
...
|
||||
check_system_tsc_reliable();
|
||||
```
|
||||
|
||||
and call the `check_system_tsc_reliable` function which sets the `tsc_clocksource_reliable` if bootstrap processor has the `X86_FEATURE_TSC_RELIABLE` feature. Note that we went through the `tsc_init` function, but did not register our clock source. Actual registration of the `Time Stamp Counter` clock source occurs in the:
|
||||
|
||||
```C
|
||||
static int __init init_tsc_clocksource(void)
|
||||
{
|
||||
if (!cpu_has_tsc || tsc_disabled > 0 || !tsc_khz)
|
||||
return 0;
|
||||
...
|
||||
...
|
||||
...
|
||||
if (boot_cpu_has(X86_FEATURE_TSC_RELIABLE)) {
|
||||
clocksource_register_khz(&clocksource_tsc, tsc_khz);
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
function. This function called during the `device` [initcall](http://www.compsoc.man.ac.uk/~moz/kernelnewbies/documents/initcall/kernel.html). We do it to be sure that the `Time Stamp Counter` clock source will be registered after the [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) clock source.
|
||||
|
||||
After these all three clock sources will be registered in the `clocksource` framework and the `Time Stamp Counter` clock source will be selected as active, because it has the highest rating among other clock sources:
|
||||
|
||||
```C
|
||||
static struct clocksource clocksource_tsc = {
|
||||
.name = "tsc",
|
||||
.rating = 300,
|
||||
.read = read_tsc,
|
||||
.mask = CLOCKSOURCE_MASK(64),
|
||||
.flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_MUST_VERIFY,
|
||||
.archdata = { .vclock_mode = VCLOCK_TSC },
|
||||
};
|
||||
```
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the sixth part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) that describes timers and timer management related stuff in the Linux kernel. In the previous part got acquainted with the `clockevents` framework. In this part we continued to learn time management related stuff in the Linux kernel and saw a little about three different clock sources which are used in the [x86](https://en.wikipedia.org/wiki/X86) architecture. The next part will be last part of this [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) and we will see some user space related stuff, i.e. how some time related [system calls](https://en.wikipedia.org/wiki/System_call) implemented in the Linux kernel.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [x86](https://en.wikipedia.org/wiki/X86)
|
||||
* [sysfs](https://en.wikipedia.org/wiki/Sysfs)
|
||||
* [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter)
|
||||
* [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer)
|
||||
* [ACPI Power Management Timer (PDF)](http://uefi.org/sites/default/files/resources/ACPI_5.pdf)
|
||||
* [frequency](https://en.wikipedia.org/wiki/Frequency).
|
||||
* [dmesg](https://en.wikipedia.org/wiki/Dmesg)
|
||||
* [programmable interval timer](https://en.wikipedia.org/wiki/Programmable_interval_timer)
|
||||
* [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
|
||||
* [IA-PC HPET (High Precision Event Timers) Specification](http://www.intel.com/content/dam/www/public/us/en/documents/technical-specifications/software-developers-hpet-spec-1-0a.pdf)
|
||||
* [IRQ0](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29#Master_PIC)
|
||||
* [i8259](https://en.wikipedia.org/wiki/Intel_8259)
|
||||
* [initcall](http://www.compsoc.man.ac.uk/~moz/kernelnewbies/documents/initcall/kernel.html)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-5.html)
|
@ -0,0 +1,421 @@
|
||||
Timers and time management in the Linux kernel. Part 7.
|
||||
================================================================================
|
||||
|
||||
Time related system calls in the Linux kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the seventh and last part [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) which describes timers and time management related stuff in the Linux kernel. In the previous [part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-6.html) we saw some [x86_64](https://en.wikipedia.org/wiki/X86-64) like [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer) and [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter). Internal time management is interesting part of the Linux kernel, but of course not only the kernel needs in the `time` concept. Our programs need to know time too. In this part, we will consider implementation of some time management related [system calls](https://en.wikipedia.org/wiki/System_call). These system calls are:
|
||||
|
||||
* `clock_gettime`;
|
||||
* `gettimeofday`;
|
||||
* `nanosleep`.
|
||||
|
||||
We will start from simple userspace [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) program and see all way from the call of the [standard library](https://en.wikipedia.org/wiki/Standard_library) function to the implementation of certain system call. As each [architecture](https://github.com/torvalds/linux/tree/master/arch) provides its own implementation of certain system call, we will consider only [x86_64](https://en.wikipedia.org/wiki/X86-64) specific implementations of system calls, as this book is related to this architecture.
|
||||
|
||||
Additionally we will not consider concept of system calls in this part, but only implementations of these three system calls in the Linux kernel. If you are interested in what is it a `system call`, there is special [chapter](https://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) about this.
|
||||
|
||||
So, let's from the `gettimeofday` system call.
|
||||
|
||||
Implementation of the `gettimeofday` system call
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As we can understand from the name of the `gettimeofday`, this function returns current time. First of all, let's look on the following simple example:
|
||||
|
||||
```C
|
||||
#include <time.h>
|
||||
#include <sys/time.h>
|
||||
#include <stdio.h>
|
||||
|
||||
int main(int argc, char **argv)
|
||||
{
|
||||
char buffer[40];
|
||||
struct timeval time;
|
||||
|
||||
gettimeofday(&time, NULL);
|
||||
|
||||
strftime(buffer, 40, "Current date/time: %m-%d-%Y/%T", localtime(&time.tv_sec));
|
||||
printf("%s\n",buffer);
|
||||
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
As you can see, here we call the `gettimeofday` function which takes two parameters: pointer to the `timeval` structure which represents an elapsed tim:
|
||||
|
||||
```C
|
||||
struct timeval {
|
||||
time_t tv_sec; /* seconds */
|
||||
suseconds_t tv_usec; /* microseconds */
|
||||
};
|
||||
```
|
||||
|
||||
The second parameter of the `gettimeofday` function is pointer to the `timezone` structure which represents a timezone. In our example, we pass address of the `timeval time` to the `gettimeofday` function, the Linux kernel fills the given `timeval` structure and returns it back to us. Additionally, we format the time with the `strftime` function to get something more human readable than elapsed microseconds. Let's see on result:
|
||||
|
||||
```C
|
||||
~$ gcc date.c -o date
|
||||
~$ ./date
|
||||
Current date/time: 03-26-2016/16:42:02
|
||||
```
|
||||
|
||||
As you already may know, an userspace application does not call a system call directly from the kernel space. Before the actual system call entry will be called, we call a function from the standard library. In my case it is [glibc](https://en.wikipedia.org/wiki/GNU_C_Library), so I will consider this case. The implementation of the `gettimeofday` function is located in the [sysdeps/unix/sysv/linux/x86/gettimeofday.c](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86/gettimeofday.c;h=36f7c26ffb0e818709d032c605fec8c4bd22a14e;hb=HEAD) source code file. As you already may know, the `gettimeofday` is not usual system call. It is located in the special area which is called `vDSO` (you can read more about it in the [part](https://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-3.html) which describes this concept).
|
||||
|
||||
The `glibc` implementation of the `gettimeofday` tries to resolve the given symbol, in our case this symbol is `__vdso_gettimeofday` by the call of the `_dl_vdso_vsym` internal function. If the symbol will not be resolved, it returns `NULL` and we fallback to the call of the usual system call:
|
||||
|
||||
```C
|
||||
return (_dl_vdso_vsym ("__vdso_gettimeofday", &linux26)
|
||||
?: (void*) (&__gettimeofday_syscall));
|
||||
```
|
||||
|
||||
The `gettimeofday` entry is located in the [arch/x86/entry/vdso/vclock_gettime.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vclock_gettime.c) source code file. As we can see the `gettimeofday` is weak alias of the `__vdso_gettimeofday`:
|
||||
|
||||
```C
|
||||
int gettimeofday(struct timeval *, struct timezone *)
|
||||
__attribute__((weak, alias("__vdso_gettimeofday")));
|
||||
```
|
||||
|
||||
The `__vdso_gettimeofday` is defined in the same source code file and calls the `do_realtime` function if the given `timeval` is not null:
|
||||
|
||||
```C
|
||||
notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
|
||||
{
|
||||
if (likely(tv != NULL)) {
|
||||
if (unlikely(do_realtime((struct timespec *)tv) == VCLOCK_NONE))
|
||||
return vdso_fallback_gtod(tv, tz);
|
||||
tv->tv_usec /= 1000;
|
||||
}
|
||||
if (unlikely(tz != NULL)) {
|
||||
tz->tz_minuteswest = gtod->tz_minuteswest;
|
||||
tz->tz_dsttime = gtod->tz_dsttime;
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
If the `do_realtime` will fail, we fallback to the real system call via call the `syscall` instruction and passing the `__NR_gettimeofday` system call number and the given `timeval` and `timezone`:
|
||||
|
||||
```C
|
||||
notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
|
||||
{
|
||||
long ret;
|
||||
|
||||
asm("syscall" : "=a" (ret) :
|
||||
"0" (__NR_gettimeofday), "D" (tv), "S" (tz) : "memory");
|
||||
return ret;
|
||||
}
|
||||
```
|
||||
|
||||
The `do_realtime` function gets the time data from the `vsyscall_gtod_data` structure which is defined in the [arch/x86/include/asm/vgtod.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/vgtod.h#L16) header file and contains mapping of the `timespec` structure and a couple of fields which are related to the current clock source in the system. This function fills the given `timeval` structure with values from the `vsyscall_gtod_data` which contains a time related data which is updated via timer interrupt.
|
||||
|
||||
First of all we try to access the `gtod` or `global time of day` the `vsyscall_gtod_data` structure via the call of the `gtod_read_begin` and will continue to do it until it will be successful:
|
||||
|
||||
```C
|
||||
do {
|
||||
seq = gtod_read_begin(gtod);
|
||||
mode = gtod->vclock_mode;
|
||||
ts->tv_sec = gtod->wall_time_sec;
|
||||
ns = gtod->wall_time_snsec;
|
||||
ns += vgetsns(&mode);
|
||||
ns >>= gtod->shift;
|
||||
} while (unlikely(gtod_read_retry(gtod, seq)));
|
||||
|
||||
ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
|
||||
ts->tv_nsec = ns;
|
||||
```
|
||||
|
||||
As we got access to the `gtod`, we fill the `ts->tv_sec` with the `gtod->wall_time_sec` which stores current time in seconds gotten from the [real time clock](https://en.wikipedia.org/wiki/Real-time_clock) during initialization of the timekeeping subsystem in the Linux kernel and the same value but in nanoseconds. In the end of this code we just fill the given `timespec` structure with the resulted values.
|
||||
|
||||
That's all about the `gettimeofday` system call. The next system call in our list is the `clock_gettime`.
|
||||
|
||||
Implementation of the clock_gettime system call
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The `clock_gettime` function gets the time which is specified by the second parameter. Generally the `clock_gettime` function takes two parameters:
|
||||
|
||||
* `clk_id` - clock identifier;
|
||||
* `timespec` - address of the `timespec` structure which represent elapsed time.
|
||||
|
||||
Let's look on the following simple example:
|
||||
|
||||
```C
|
||||
#include <time.h>
|
||||
#include <sys/time.h>
|
||||
#include <stdio.h>
|
||||
|
||||
int main(int argc, char **argv)
|
||||
{
|
||||
struct timespec elapsed_from_boot;
|
||||
|
||||
clock_gettime(CLOCK_BOOTTIME, &elapsed_from_boot);
|
||||
|
||||
printf("%d - seconds elapsed from boot\n", elapsed_from_boot.tv_sec);
|
||||
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
which prints `uptime` information:
|
||||
|
||||
```C
|
||||
~$ gcc uptime.c -o uptime
|
||||
~$ ./uptime
|
||||
14180 - seconds elapsed from boot
|
||||
```
|
||||
|
||||
We can easily check the result with the help of the [uptime](https://en.wikipedia.org/wiki/Uptime#Using_uptime) util:
|
||||
|
||||
```
|
||||
~$ uptime
|
||||
up 3:56
|
||||
```
|
||||
|
||||
The `elapsed_from_boot.tv_sec` represents elapsed time in seconds, so:
|
||||
|
||||
```python
|
||||
>>> 14180 / 60
|
||||
236
|
||||
>>> 14180 / 60 / 60
|
||||
3
|
||||
>>> 14180 / 60 % 60
|
||||
56
|
||||
```
|
||||
|
||||
The `clock_id` maybe one of the following:
|
||||
|
||||
* `CLOCK_REALTIME` - system wide clock which measures real or wall-clock time;
|
||||
* `CLOCK_REALTIME_COARSE` - faster version of the `CLOCK_REALTIME`;
|
||||
* `CLOCK_MONOTONIC` - represents monotonic time since some unspecified starting point;
|
||||
* `CLOCK_MONOTONIC_COARSE` - faster version of the `CLOCK_MONOTONIC`;
|
||||
* `CLOCK_MONOTONIC_RAW` - the same as the `CLOCK_MONOTONIC` but provides non [NTP](https://en.wikipedia.org/wiki/Network_Time_Protocol) adjusted time.
|
||||
* `CLOCK_BOOTTIME` - the same as the `CLOCK_MONOTONIC` but plus time that the system was suspended;
|
||||
* `CLOCK_PROCESS_CPUTIME_ID` - per-process time consumed by all threads in the process;
|
||||
* `CLOCK_THREAD_CPUTIME_ID` - thread-specific clock.
|
||||
|
||||
The `clock_gettime` is not usual syscall too, but as the `gettimeofday`, this system call is placed in the `vDSO` area. Entry of this system call is located in the same source code file - [arch/x86/entry/vdso/vclock_gettime.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vclock_gettime.c)) as for `gettimeofday`.
|
||||
|
||||
The Implementation of the `clock_gettime` depends on the clock id. If we have passed the `CLOCK_REALTIME` clock id, the `do_realtime` function will be called:
|
||||
|
||||
```C
|
||||
notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
|
||||
{
|
||||
switch (clock) {
|
||||
case CLOCK_REALTIME:
|
||||
if (do_realtime(ts) == VCLOCK_NONE)
|
||||
goto fallback;
|
||||
break;
|
||||
...
|
||||
...
|
||||
...
|
||||
fallback:
|
||||
return vdso_fallback_gettime(clock, ts);
|
||||
}
|
||||
```
|
||||
|
||||
In other cases, the `do_{name_of_clock_id}` function is called. Implementations of some of them is similar. For example if we will pass the `CLOCK_MONOTONIC` clock id:
|
||||
|
||||
```C
|
||||
...
|
||||
...
|
||||
...
|
||||
case CLOCK_MONOTONIC:
|
||||
if (do_monotonic(ts) == VCLOCK_NONE)
|
||||
goto fallback;
|
||||
break;
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
the `do_monotonic` function will be called which is very similar on the implementation of the `do_realtime`:
|
||||
|
||||
```C
|
||||
notrace static int __always_inline do_monotonic(struct timespec *ts)
|
||||
{
|
||||
do {
|
||||
seq = gtod_read_begin(gtod);
|
||||
mode = gtod->vclock_mode;
|
||||
ts->tv_sec = gtod->monotonic_time_sec;
|
||||
ns = gtod->monotonic_time_snsec;
|
||||
ns += vgetsns(&mode);
|
||||
ns >>= gtod->shift;
|
||||
} while (unlikely(gtod_read_retry(gtod, seq)));
|
||||
|
||||
ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
|
||||
ts->tv_nsec = ns;
|
||||
|
||||
return mode;
|
||||
}
|
||||
```
|
||||
|
||||
We already saw a little about the implementation of this function in the previous paragraph about the `gettimeofday`. There is only one difference here, that the `sec` and `nsec` of our `timespec` value will be based on the `gtod->monotonic_time_sec` instead of `gtod->wall_time_sec` which maps the value of the `tk->tkr_mono.xtime_nsec` or number of [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond) elapsed.
|
||||
|
||||
That's all.
|
||||
|
||||
Implementation of the `nanosleep` system call
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The last system call in our list is the `nanosleep`. As you can understand from its name, this function provides `sleeping` ability. Let's look on the following simple example:
|
||||
|
||||
```C
|
||||
#include <time.h>
|
||||
#include <stdlib.h>
|
||||
#include <stdio.h>
|
||||
|
||||
int main (void)
|
||||
{
|
||||
struct timespec ts = {5,0};
|
||||
|
||||
printf("sleep five seconds\n");
|
||||
nanosleep(&ts, NULL);
|
||||
printf("end of sleep\n");
|
||||
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
If we will compile and run it, we will see the first line
|
||||
|
||||
```
|
||||
~$ gcc sleep_test.c -o sleep
|
||||
~$ ./sleep
|
||||
sleep five seconds
|
||||
end of sleep
|
||||
```
|
||||
|
||||
and the second line after five seconds.
|
||||
|
||||
The `nanosleep` is not located in the `vDSO` area like the `gettimeofday` and the `clock_gettime` functions. So, let's look how the `real` system call which is located in the kernel space will be called by the standard library. The implementation of the `nanosleep` system call will be called with the help of the [syscall](http://www.felixcloutier.com/x86/SYSCALL.html) instruction. Before the execution of the `syscall` instruction, parameters of the system call must be put in processor [registers](https://en.wikipedia.org/wiki/Processor_register) according to order which is described in the [System V Application Binary Interface](http://www.x86-64.org/documentation/abi.pdf) or in other words:
|
||||
|
||||
* `rdi` - first parameter;
|
||||
* `rsi` - second parameter;
|
||||
* `rdx` - third parameter;
|
||||
* `r10` - fourth parameter;
|
||||
* `r8` - fifth parameter;
|
||||
* `r9` - sixth parameter.
|
||||
|
||||
The `nanosleep` system call has two parameters - two pointers to the `timespec` structures. The system call suspends the calling thread until the given timeout has elapsed. Additionally it will finish if a signal interrupts its execution. It takes two parameters, the first is `timespec` which represents timeout for the sleep. The second parameter is the pointer to the `timespec` structure too and it contains remainder of time if the call of the `nanosleep` was interrupted.
|
||||
|
||||
As `nanosleep` has two parameters:
|
||||
|
||||
```C
|
||||
int nanosleep(const struct timespec *req, struct timespec *rem);
|
||||
```
|
||||
|
||||
To call system call, we need put the `req` to the `rdi` register, and the `rem` parameter to the `rsi` register. The [glibc](https://en.wikipedia.org/wiki/GNU_C_Library) does these job in the `INTERNAL_SYSCALL` macro which is located in the [sysdeps/unix/sysv/linux/x86_64/sysdep.h](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/sysdep.h;h=d023d68174d3dfb4e698160b31ae31ad291802e1;hb=HEAD) header file.
|
||||
|
||||
```C
|
||||
# define INTERNAL_SYSCALL(name, err, nr, args...) \
|
||||
INTERNAL_SYSCALL_NCS (__NR_##name, err, nr, ##args)
|
||||
```
|
||||
|
||||
which takes the name of the system call, storage for possible error during execution of system call, number of the system call (all `x86_64` system calls you can find in the [system calls table](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl)) and arguments of certain system call. The `INTERNAL_SYSCALL` macro just expands to the call of the `INTERNAL_SYSCALL_NCS` macro, which prepares arguments of system call (puts them into the processor registers in correct order), executes `syscall` instruction and returns the result:
|
||||
|
||||
```C
|
||||
# define INTERNAL_SYSCALL_NCS(name, err, nr, args...) \
|
||||
({ \
|
||||
unsigned long int resultvar; \
|
||||
LOAD_ARGS_##nr (args) \
|
||||
LOAD_REGS_##nr \
|
||||
asm volatile ( \
|
||||
"syscall\n\t" \
|
||||
: "=a" (resultvar) \
|
||||
: "0" (name) ASM_ARGS_##nr : "memory", REGISTERS_CLOBBERED_BY_SYSCALL); \
|
||||
(long int) resultvar; })
|
||||
```
|
||||
|
||||
The `LOAD_ARGS_##nr` macro calls the `LOAD_ARGS_N` macro where the `N` is number of arguments of the system call. In our case, it will be the `LOAD_ARGS_2` macro. Ultimately all of these macros will be expanded to the following:
|
||||
|
||||
```C
|
||||
# define LOAD_REGS_TYPES_1(t1, a1) \
|
||||
register t1 _a1 asm ("rdi") = __arg1; \
|
||||
LOAD_REGS_0
|
||||
|
||||
# define LOAD_REGS_TYPES_2(t1, a1, t2, a2) \
|
||||
register t2 _a2 asm ("rsi") = __arg2; \
|
||||
LOAD_REGS_TYPES_1(t1, a1)
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
After the `syscall` instruction will be executed, the [context switch](https://en.wikipedia.org/wiki/Context_switch) will occur and the kernel will transfer execution to the system call handler. The system call handler for the `nanosleep` system call is located in the [kernel/time/hrtimer.c](https://github.com/torvalds/linux/blob/master/kernel/time/hrtimer.c) source code file and defined with the `SYSCALL_DEFINE2` macro helper:
|
||||
|
||||
```C
|
||||
SYSCALL_DEFINE2(nanosleep, struct timespec __user *, rqtp,
|
||||
struct timespec __user *, rmtp)
|
||||
{
|
||||
struct timespec tu;
|
||||
|
||||
if (copy_from_user(&tu, rqtp, sizeof(tu)))
|
||||
return -EFAULT;
|
||||
|
||||
if (!timespec_valid(&tu))
|
||||
return -EINVAL;
|
||||
|
||||
return hrtimer_nanosleep(&tu, rmtp, HRTIMER_MODE_REL, CLOCK_MONOTONIC);
|
||||
}
|
||||
```
|
||||
|
||||
More about the `SYSCALL_DEFINE2` macro you may read in the [chapter](https://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) about system calls. If we look at the implementation of the `nanosleep` system call, first of all we will see that it starts from the call of the `copy_from_user` function. This function copies the given data from the userspace to kernelspace. In our case we copy timeout value to sleep to the kernelspace `timespec` structure and check that the given `timespec` is valid by the call of the `timesc_valid` function:
|
||||
|
||||
```C
|
||||
static inline bool timespec_valid(const struct timespec *ts)
|
||||
{
|
||||
if (ts->tv_sec < 0)
|
||||
return false;
|
||||
if ((unsigned long)ts->tv_nsec >= NSEC_PER_SEC)
|
||||
return false;
|
||||
return true;
|
||||
}
|
||||
```
|
||||
|
||||
which just checks that the given `timespec` does not represent date before `1970` and nanoseconds does not overflow `1` second. The `nanosleep` function ends with the call of the `hrtimer_nanosleep` function from the same source code file. The `hrtimer_nanosleep` function creates a [timer](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-4.html) and calls the `do_nanosleep` function. The `do_nanosleep` does main job for us. This function provides loop:
|
||||
|
||||
```C
|
||||
do {
|
||||
set_current_state(TASK_INTERRUPTIBLE);
|
||||
hrtimer_start_expires(&t->timer, mode);
|
||||
|
||||
if (likely(t->task))
|
||||
freezable_schedule();
|
||||
|
||||
} while (t->task && !signal_pending(current));
|
||||
|
||||
__set_current_state(TASK_RUNNING);
|
||||
return t->task == NULL;
|
||||
```
|
||||
|
||||
Which freezes current task during sleep. After we set `TASK_INTERRUPTIBLE` flag for the current task, the `hrtimer_start_expires` function starts the give high-resolution timer on the current processor. As the given high resolution timer will expire, the task will be again running.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the seventh part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/Timers/index.html) that describes timers and timer management related stuff in the Linux kernel. In the previous part we saw [x86_64](https://en.wikipedia.org/wiki/X86-64) specific clock sources. As I wrote in the beginning, this part is the last part of this chapter. We saw important time management related concepts like `clocksource` and `clockevents` frameworks, `jiffies` counter and etc., in this chpater. Of course this does not cover all of the time management in the Linux kernel. Many parts of this mostly related to the scheduling which we will see in other chapter.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [system call](https://en.wikipedia.org/wiki/System_call)
|
||||
* [C programming language](https://en.wikipedia.org/wiki/C_%28programming_language%29)
|
||||
* [standard library](https://en.wikipedia.org/wiki/Standard_library)
|
||||
* [glibc](https://en.wikipedia.org/wiki/GNU_C_Library)
|
||||
* [real time clock](https://en.wikipedia.org/wiki/Real-time_clock)
|
||||
* [NTP](https://en.wikipedia.org/wiki/Network_Time_Protocol)
|
||||
* [nanoseconds](https://en.wikipedia.org/wiki/Nanosecond)
|
||||
* [register](https://en.wikipedia.org/wiki/Processor_register)
|
||||
* [System V Application Binary Interface](http://www.x86-64.org/documentation/abi.pdf)
|
||||
* [context switch](https://en.wikipedia.org/wiki/Context_switch)
|
||||
* [Introduction to timers in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-4.html)
|
||||
* [uptime](https://en.wikipedia.org/wiki/Uptime#Using_uptime)
|
||||
* [system calls table for x86_64](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl)
|
||||
* [High Precision Event Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer)
|
||||
* [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [previous part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-6.html)
|
@ -1,14 +1,14 @@
|
||||
# Interrupts and Interrupt Handling
|
||||
|
||||
You will find a couple of posts which describe interrupts and exceptions handling in the linux kernel.
|
||||
In the following posts, we will cover interrupts and exceptions handling in the linux kernel.
|
||||
|
||||
* [Interrupts and Interrupt Handling. Part 1.](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-1.md) - describes an interrupts handling theory.
|
||||
* [Start to dive into interrupts in the Linux kernel](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-2.md) - this part starts to describe interrupts and exceptions handling related stuff from the early stage.
|
||||
* [Early interrupt handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-3.md) - third part describes early interrupt handlers.
|
||||
* [Interrupt handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-4.md) - fourth part describes first non-early interrupt handlers.
|
||||
* [Implementation of exception handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-5.md) - descripbes implementation of some exception handlers as double fault, divide by zero and etc.
|
||||
* [Handling Non-Maskable interrupts](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-6.md) - describes handling of non-maskable interrupts and the rest of interrupts handlers from the architecture-specific part.
|
||||
* [Dive into external hardware interrupts](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-7.md) - this part describes early initialization of code which is related to handling of external hardware interrupts.
|
||||
* [Non-early initialization of the IRQs](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-8.md) - this part describes non-early initialization of code which is related to handling of external hardware interrupts.
|
||||
* [Softirq, Tasklets and Workqueues](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-9.md) - this part describes softirqs, tasklets and workqueues concepts.
|
||||
* [](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-10.md) - this is the last part of the interrupts and interrupt handling chapter and here we will see a real hardware driver and interrupts related stuff.
|
||||
* [Interrupts and Interrupt Handling. Part 1.](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-1.md) - describes interrupts and interrupt handling theory.
|
||||
* [Interrupts in the Linux Kernel](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-2.md) - describes stuffs related to interrupts and exceptions handling from the early stage.
|
||||
* [Early interrupt handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-3.md) - describes early interrupt handlers.
|
||||
* [Interrupt handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-4.md) - describes first non-early interrupt handlers.
|
||||
* [Implementation of exception handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-5.md) - describes implementation of some exception handlers such as double fault, divide by zero etc.
|
||||
* [Handling non-maskable interrupts](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-6.md) - describes handling of non-maskable interrupts and remaining interrupt handlers from the architecture-specific part.
|
||||
* [External hardware interrupts](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-7.md) - describes early initialization of code which is related to handling external hardware interrupts.
|
||||
* [Non-early initialization of the IRQs](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-8.md) - describes non-early initialization of code which is related to handling external hardware interrupts.
|
||||
* [Softirq, Tasklets and Workqueues](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-9.md) - describes softirqs, tasklets and workqueues concepts.
|
||||
* [](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-10.md) - this is the last part of the `Interrupts and Interrupt Handling` chapter and here we will see a real hardware driver and some interrupts related stuff.
|
||||
|
@ -0,0 +1,434 @@
|
||||
Linux kernel memory management Part 3.
|
||||
================================================================================
|
||||
|
||||
Introduction to the kmemcheck in the Linux kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the third part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/mm/) which describes [memory management](https://en.wikipedia.org/wiki/Memory_management) in the Linux kernel and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html) of this chapter we met two memory management related concepts:
|
||||
|
||||
* `Fix-Mapped Addresses`;
|
||||
* `ioremap`.
|
||||
|
||||
The first concept represents special area in [virtual memory](https://en.wikipedia.org/wiki/Virtual_memory), whose corresponding physical mapping is calculated in [compile-time](https://en.wikipedia.org/wiki/Compile_time). The second concept provides ability to map input/output related memory to virtual memory.
|
||||
|
||||
For example if you will look at the output of the `/proc/iomem`:
|
||||
|
||||
```
|
||||
$ sudo cat /proc/iomem
|
||||
|
||||
00000000-00000fff : reserved
|
||||
00001000-0009d7ff : System RAM
|
||||
0009d800-0009ffff : reserved
|
||||
000a0000-000bffff : PCI Bus 0000:00
|
||||
000c0000-000cffff : Video ROM
|
||||
000d0000-000d3fff : PCI Bus 0000:00
|
||||
000d4000-000d7fff : PCI Bus 0000:00
|
||||
000d8000-000dbfff : PCI Bus 0000:00
|
||||
000dc000-000dffff : PCI Bus 0000:00
|
||||
000e0000-000fffff : reserved
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
you will see map of the system's memory for each physical device. Here the first column displays the memory registers used by each of the different types of memory. The second column lists the kind of memory located within those registers. Or for example:
|
||||
|
||||
```
|
||||
$ sudo cat /proc/ioports
|
||||
|
||||
0000-0cf7 : PCI Bus 0000:00
|
||||
0000-001f : dma1
|
||||
0020-0021 : pic1
|
||||
0040-0043 : timer0
|
||||
0050-0053 : timer1
|
||||
0060-0060 : keyboard
|
||||
0064-0064 : keyboard
|
||||
0070-0077 : rtc0
|
||||
0080-008f : dma page reg
|
||||
00a0-00a1 : pic2
|
||||
00c0-00df : dma2
|
||||
00f0-00ff : fpu
|
||||
00f0-00f0 : PNP0C04:00
|
||||
03c0-03df : vga+
|
||||
03f8-03ff : serial
|
||||
04d0-04d1 : pnp 00:06
|
||||
0800-087f : pnp 00:01
|
||||
0a00-0a0f : pnp 00:04
|
||||
0a20-0a2f : pnp 00:04
|
||||
0a30-0a3f : pnp 00:04
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
can show us lists of currently registered port regions used for input or output communication with a device. All memory-mapped I/O addresses are not used by the kernel directly. So, before the Linux kernel can use such memory, it must to map it to the virtual memory space which is the main purpose of the `ioremap` mechanism. Note that we saw only early `ioremap` in the previous [part](https://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html). Soon we will look at the implementation of the non-early `ioremap` function. But before this we must learn other things, like a different types of memory allocators and etc., because in other way it will be very difficult to understand it.
|
||||
|
||||
So, before we will move on to the non-early [memory management](https://en.wikipedia.org/wiki/Memory_management) of the Linux kernel, we will see some mechanisms which provide special abilities for [debugging](https://en.wikipedia.org/wiki/Debugging), check of [memory leaks](https://en.wikipedia.org/wiki/Memory_leak), memory control and etc. It will be easier to understand how memory management arranged in the Linux kernel after learning of all of these things.
|
||||
|
||||
As you already may guess from the title of this part, we will start to consider memory mechanisms from the [kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt). As we always did in other [chapters](https://0xax.gitbooks.io/linux-insides/content/), we will start to consider from theoretical side and will learn what is `kmemcheck` mechanism in general and only after this, we will see how it is implemented in the Linux kernel.
|
||||
|
||||
So let's start. What is it `kmemcheck` in the Linux kernel? As you may gues from the name of this mechanism, the `kmemcheck` checks memory. That's true. Main point of the `kmemcheck` mechanism is to check that some kernel code accesses `uninitialized memory`. Let's take following simple [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) program:
|
||||
|
||||
```C
|
||||
#include <stdlib.h>
|
||||
#include <stdio.h>
|
||||
|
||||
struct A {
|
||||
int a;
|
||||
};
|
||||
|
||||
int main(int argc, char **argv) {
|
||||
struct A *a = malloc(sizeof(struct A));
|
||||
printf("a->a = %d\n", a->a);
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
Here we allocate memory for the `A` structure and tries to print value of the `a` field. If we will compile this program without additional options:
|
||||
|
||||
```
|
||||
gcc test.c -o test
|
||||
```
|
||||
|
||||
The [compiler](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) will not show us warning that `a` filed is not unitialized. But if we will run this program with [valgrind](https://en.wikipedia.org/wiki/Valgrind) tool, we will see the following output:
|
||||
|
||||
```
|
||||
~$ valgrind --leak-check=yes ./test
|
||||
==28469== Memcheck, a memory error detector
|
||||
==28469== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
|
||||
==28469== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
|
||||
==28469== Command: ./test
|
||||
==28469==
|
||||
==28469== Conditional jump or move depends on uninitialised value(s)
|
||||
==28469== at 0x4E820EA: vfprintf (in /usr/lib64/libc-2.22.so)
|
||||
==28469== by 0x4E88D48: printf (in /usr/lib64/libc-2.22.so)
|
||||
==28469== by 0x4005B9: main (in /home/alex/test)
|
||||
==28469==
|
||||
==28469== Use of uninitialised value of size 8
|
||||
==28469== at 0x4E7E0BB: _itoa_word (in /usr/lib64/libc-2.22.so)
|
||||
==28469== by 0x4E8262F: vfprintf (in /usr/lib64/libc-2.22.so)
|
||||
==28469== by 0x4E88D48: printf (in /usr/lib64/libc-2.22.so)
|
||||
==28469== by 0x4005B9: main (in /home/alex/test)
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
Actually the `kmemcheck` mechanism does the same for the kernel, what the `valgrind` does for userspace programs. It check unitilized memory.
|
||||
|
||||
To enable this mechanism in the Linux kernel, you need to enable the `CONFIG_KMEMCHECK` kernel configuration option in the:
|
||||
|
||||
```
|
||||
Kernel hacking
|
||||
-> Memory Debugging
|
||||
```
|
||||
|
||||
menu of the Linux kernel configuration:
|
||||
|
||||
![kernel configuration menu](http://oi63.tinypic.com/2pzbog7.jpg)
|
||||
|
||||
We may not only enable support of the `kmemcheck` mechanism in the Linux kernel, but it also provides some configuration options for us. We will see all of these options in the next paragraph of this part. Last note before we will consider how does the `kmemcheck` check memory. Now this mechanism is implemented only for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture. You can be sure if you will look in the [arch/x86/Kconfig](https://github.com/torvalds/linux/blob/master/arch/x86/Kconfig) `x86` related kernel configuration file, you will see following lines:
|
||||
|
||||
```
|
||||
config X86
|
||||
...
|
||||
...
|
||||
...
|
||||
select HAVE_ARCH_KMEMCHECK
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
So, there is no anything which is specific for other architectures.
|
||||
|
||||
Ok, so we know that `kmemcheck` provides mechanism to check usage of `uninitialized memory` in the Linux kernel and how to enable it. How it does these checks? When the Linux kernel tries to allocate some memory i.e. something is called like this:
|
||||
|
||||
```C
|
||||
struct my_struct *my_struct = kmalloc(sizeof(struct my_struct), GFP_KERNEL);
|
||||
```
|
||||
|
||||
or in other words somebody wants to access a [page](https://en.wikipedia.org/wiki/Page_%28computer_memory%29), a [page fault](https://en.wikipedia.org/wiki/Page_fault) exception is generated. This is achieved by the fact that the `kmemcheck` marks memory pages as `non-present` (more about this you can read in the special part which is devoted to [paging](https://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html)). If a `page fault` exception is occured, the exception handler knows about it and in a case when the `kmemcheck` is enabled it transfers control to it. After the `kmemcheck` will finish its checks, the page will be marked as `present` and the interrupted code will be able to continue execution. There is little subtlety in this chain. When the first instruction of interrupted code will be executed, the `kmemcheck` will mark the page as `non-present` again. In this way next access to memory will be catched again.
|
||||
|
||||
We just considered the `kmemcheck` mechanism from theoretical side. Now let's consider how it is implemented in the Linux kernel.
|
||||
|
||||
Implementation of the `kmemcheck` mechanism in the Linux kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
So, now we know what is it `kmemcheck` and what it does in the Linux kernel. Time to see at its implementation in the Linux kernel. Implementation of the `kmemcheck` is splitted in two parts. The first is generic part is located in the [mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c) source code file and the second [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture-specific part is located in the [arch/x86/mm/kmemcheck](https://github.com/torvalds/linux/tree/master/arch/x86/mm/kmemcheck) directory.
|
||||
|
||||
Let's start from the initialization of this mechanism. We already know that to enable the `kmemcheck` mechanism in the Linux kernel, we must enable the `CONFIG_KMEMCHECK` kernel configuration option. But besides this, we need to pass one of following parameters:
|
||||
|
||||
* kmemcheck=0 (disabled)
|
||||
* kmemcheck=1 (enabled)
|
||||
* kmemcheck=2 (one-shot mode)
|
||||
|
||||
to the Linux kernel command line. The first two are clear, but the last needs a little explanation. This option switches the `kmemcheck` in a special mode when it will be turned off after detecting the first use of uninitialized memory. Actually this mode is enabled by default in the Linux kernel:
|
||||
|
||||
![kernel configuration menu](http://oi66.tinypic.com/y2eeh.jpg)
|
||||
|
||||
We know from the seventh [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-7.html) of the chapter which describes initialization of the Linux kernel that the kernel command line is parsed during initialization of the Linux kernel in `do_initcall_level`, `do_early_param` functions. Actually the `kmemcheck` subsystem consists from two stages. The first stage is early. If we will look at the [mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c) source code file, we will see the `param_kmemcheck` function which is will be called during early command line parsing:
|
||||
|
||||
```C
|
||||
static int __init param_kmemcheck(char *str)
|
||||
{
|
||||
int val;
|
||||
int ret;
|
||||
|
||||
if (!str)
|
||||
return -EINVAL;
|
||||
|
||||
ret = kstrtoint(str, 0, &val);
|
||||
if (ret)
|
||||
return ret;
|
||||
kmemcheck_enabled = val;
|
||||
return 0;
|
||||
}
|
||||
|
||||
early_param("kmemcheck", param_kmemcheck);
|
||||
```
|
||||
|
||||
As we already saw, the `param_kmemcheck` may have one of the following values: `0` (enabled), `1` (disabled) or `2` (one-shot). The implementation of the `param_kmemcheck` is pretty simple. We just convert string value of the `kmemcheck` command line option to integer representation and set it to the `kmemcheck_enabled` variable.
|
||||
|
||||
The second stage will be executed during initialization of the Linux kernel, rather during intialization of early [initcalls](https://0xax.gitbooks.io/linux-insides/content/Concepts/initcall.html). The second stage is represented by the `kmemcheck_init`:
|
||||
|
||||
```C
|
||||
int __init kmemcheck_init(void)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
|
||||
early_initcall(kmemcheck_init);
|
||||
```
|
||||
|
||||
Main goal of the `kmemcheck_init` function is to call the `kmemcheck_selftest` function and check its result:
|
||||
|
||||
```C
|
||||
if (!kmemcheck_selftest()) {
|
||||
printk(KERN_INFO "kmemcheck: self-tests failed; disabling\n");
|
||||
kmemcheck_enabled = 0;
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
printk(KERN_INFO "kmemcheck: Initialized\n");
|
||||
```
|
||||
|
||||
and return with the `EINVAL` if this check is failed. The `kmemcheck_selftest` function checks sizes of different memory access related [opcodes](https://en.wikipedia.org/wiki/Opcode) like `rep movsb`, `movzwq` and etc. If sizes of opcodes are equal to expected sizes, the `kmemcheck_selftest` will return `true` and `false` in other way.
|
||||
|
||||
So when the somebody will call:
|
||||
|
||||
```C
|
||||
struct my_struct *my_struct = kmalloc(sizeof(struct my_struct), GFP_KERNEL);
|
||||
```
|
||||
|
||||
through a series of different function calls the `kmem_getpages` function will be called. This function is defined in the [mm/slab.c](https://github.com/torvalds/linux/blob/master/mm/slab.c) source code file and main goal of this function tries to allocate [pages](https://en.wikipedia.org/wiki/Paging) with the given flags. In the end of this function we can see following code:
|
||||
|
||||
```C
|
||||
if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) {
|
||||
kmemcheck_alloc_shadow(page, cachep->gfporder, flags, nodeid);
|
||||
|
||||
if (cachep->ctor)
|
||||
kmemcheck_mark_uninitialized_pages(page, nr_pages);
|
||||
else
|
||||
kmemcheck_mark_unallocated_pages(page, nr_pages);
|
||||
}
|
||||
```
|
||||
|
||||
So, here we check that the if `kmemcheck` is enabled and the `SLAB_NOTRACK` bit is not set in flags we set `non-present` bit for the just allocated page. The `SLAB_NOTRACK` bit tell us to not track uninitialized memory. Additionally we check if a cache object has constructor (details will be considered in next parts) we mark allocated page as uninitilized or unallocated in other way. The `kmemcheck_alloc_shadow` function is defined in the [mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c) source code file and does following things:
|
||||
|
||||
```C
|
||||
void kmemcheck_alloc_shadow(struct page *page, int order, gfp_t flags, int node)
|
||||
{
|
||||
struct page *shadow;
|
||||
|
||||
shadow = alloc_pages_node(node, flags | __GFP_NOTRACK, order);
|
||||
|
||||
for(i = 0; i < pages; ++i)
|
||||
page[i].shadow = page_address(&shadow[i]);
|
||||
|
||||
kmemcheck_hide_pages(page, pages);
|
||||
}
|
||||
```
|
||||
|
||||
First of all it allocates memory space for the shadow bits. If this bit is set in a page, this means that this page is tracked by the `kmemcheck`. After we allocated space for the shadow bit, we fill all allocated pages with this bit. In the end we just call the `kmemcheck_hide_pages` function with the pointer to the allocated page and number of these pages. The `kmemcheck_hide_pages` is architecture-specific function, so its implementation is located in the [arch/x86/mm/kmemcheck/kmemcheck.c](https://github.com/torvalds/linux/tree/master/arch/x86/mm/kmemcheck/kmemcheck.c) source code file. The main goal of this function is to set `non-present` bit in given pages. Let's look at the implementation of this function:
|
||||
|
||||
```C
|
||||
void kmemcheck_hide_pages(struct page *p, unsigned int n)
|
||||
{
|
||||
unsigned int i;
|
||||
|
||||
for (i = 0; i < n; ++i) {
|
||||
unsigned long address;
|
||||
pte_t *pte;
|
||||
unsigned int level;
|
||||
|
||||
address = (unsigned long) page_address(&p[i]);
|
||||
pte = lookup_address(address, &level);
|
||||
BUG_ON(!pte);
|
||||
BUG_ON(level != PG_LEVEL_4K);
|
||||
|
||||
set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT));
|
||||
set_pte(pte, __pte(pte_val(*pte) | _PAGE_HIDDEN));
|
||||
__flush_tlb_one(address);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Here we go through all pages and and tries to get `page table entry` for each page. If this operation was successful, we unset present bit and set hidden bit in each page. In the end we flush [translation lookaside buffer](https://en.wikipedia.org/wiki/Translation_lookaside_buffer), because some pages was changed. From this point allocated pages are tracked by the `kmemcheck`. Now, as `present` bit is unset, the [page fault](https://en.wikipedia.org/wiki/Page_fault) execution will be occured right after the `kmalloc` will return pointer to allocated space and a code will try to access this memory.
|
||||
|
||||
As you may remember from the [second part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) of the Linux kernel initialization chapter, the `page fault` handler is located in the [arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/fault.c) source code file and represented by the `do_page_fault` function. We can see following check from the beginning of the `do_page_fault` function:
|
||||
|
||||
```C
|
||||
static noinline void
|
||||
__do_page_fault(struct pt_regs *regs, unsigned long error_code,
|
||||
unsigned long address)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
if (kmemcheck_active(regs))
|
||||
kmemcheck_hide(regs);
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
The `kmemcheck_active` gets `kmemcheck_context` [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) structure and return the result of comparision of the `balance` field of this structure with zero:
|
||||
|
||||
```
|
||||
bool kmemcheck_active(struct pt_regs *regs)
|
||||
{
|
||||
struct kmemcheck_context *data = this_cpu_ptr(&kmemcheck_context);
|
||||
|
||||
return data->balance > 0;
|
||||
}
|
||||
```
|
||||
|
||||
The `kmemcheck_context` is structure which describes current state of the `kmemcheck` mechanism. It stored unitialized addresses, number of such addresses and etc. The `balance` field of this structure represents current state of the `kmemcheck` or in other words it can tell us did `kmemcheck` already hid pages or not yet. If the `data->balance` is greater than zero, the `kmemcheck_hide` function will be called. This means than `kmemecheck` already set `present` bit for given pages and now we need to hide pages again to to cause nest step page fault. This function will hide addresses of pages again by unsetting of `present` bit. This means that one session of `kmemcheck` already finished and new page fault occured. At the first step the `kmemcheck_active` will return false as the `data->balance` is zero for the start and the `kmemcheck_hide` will not be called. Next, we may see following line of code in the `do_page_fault`:
|
||||
|
||||
```C
|
||||
if (kmemcheck_fault(regs, address, error_code))
|
||||
return;
|
||||
```
|
||||
|
||||
First of all the `kmemcheck_fault` function checks that the fault was occured by the correct reason. At first we check the [flags register](https://en.wikipedia.org/wiki/FLAGS_register) and check that we are in normal kernel mode:
|
||||
|
||||
```C
|
||||
if (regs->flags & X86_VM_MASK)
|
||||
return false;
|
||||
if (regs->cs != __KERNEL_CS)
|
||||
return false;
|
||||
```
|
||||
|
||||
If these checks wasn't successful we return from the `kmemcheck_fault` function as it was not `kmemcheck` related page fault. After this we try to lookup a `page table entry` related to the faulted address and if we can't find it we return:
|
||||
|
||||
```C
|
||||
pte = kmemcheck_pte_lookup(address);
|
||||
if (!pte)
|
||||
return false;
|
||||
```
|
||||
|
||||
Last two steps of the `kmemcheck_fault` function is to call the `kmemcheck_access` function which check access to the given page and show addresses again by setting present bit in the given page. The `kmemcheck_access` function does all main job. It check current instruction which caused a page fault. If it will find an error, the context of this error will be saved by `kmemcheck` to the ring queue:
|
||||
|
||||
```C
|
||||
static struct kmemcheck_error error_fifo[CONFIG_KMEMCHECK_QUEUE_SIZE];
|
||||
```
|
||||
|
||||
The `kmemcheck` mechanism declares special [tasklet](https://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html):
|
||||
|
||||
```C
|
||||
static DECLARE_TASKLET(kmemcheck_tasklet, &do_wakeup, 0);
|
||||
```
|
||||
|
||||
which runs the `do_wakeup` function from the [arch/x86/mm/kmemcheck/error.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/kmemcheck/error.c) source code file when it will be scheduled to run.
|
||||
|
||||
The `do_wakeup` function will call the `kmemcheck_error_recall` function which will print errors collected by `kmemcheck`. As we already saw the:
|
||||
|
||||
```C
|
||||
kmemcheck_show(regs);
|
||||
```
|
||||
|
||||
function will be called in the end of the `kmemcheck_fault` function. This function will set present bit for the given pages again:
|
||||
|
||||
```C
|
||||
if (unlikely(data->balance != 0)) {
|
||||
kmemcheck_show_all();
|
||||
kmemcheck_error_save_bug(regs);
|
||||
data->balance = 0;
|
||||
return;
|
||||
}
|
||||
```
|
||||
|
||||
Where the `kmemcheck_show_all` function calls the `kmemcheck_show_addr` for each address:
|
||||
|
||||
```C
|
||||
static unsigned int kmemcheck_show_all(void)
|
||||
{
|
||||
struct kmemcheck_context *data = this_cpu_ptr(&kmemcheck_context);
|
||||
unsigned int i;
|
||||
unsigned int n;
|
||||
|
||||
n = 0;
|
||||
for (i = 0; i < data->n_addrs; ++i)
|
||||
n += kmemcheck_show_addr(data->addr[i]);
|
||||
|
||||
return n;
|
||||
}
|
||||
```
|
||||
|
||||
by the call of the `kmemcheck_show_addr`:
|
||||
|
||||
```C
|
||||
int kmemcheck_show_addr(unsigned long address)
|
||||
{
|
||||
pte_t *pte;
|
||||
|
||||
pte = kmemcheck_pte_lookup(address);
|
||||
if (!pte)
|
||||
return 0;
|
||||
|
||||
set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT));
|
||||
__flush_tlb_one(address);
|
||||
return 1;
|
||||
}
|
||||
```
|
||||
|
||||
In the end of the `kmemcheck_show` function we set the [TF](https://en.wikipedia.org/wiki/Trap_flag) flag if it wasn't set:
|
||||
|
||||
```C
|
||||
if (!(regs->flags & X86_EFLAGS_TF))
|
||||
data->flags = regs->flags;
|
||||
```
|
||||
|
||||
We need to do it because we need to hide pages again after first executed instruction after a page fault will be handled. In a case when the `TF` flag, so the processor will switch into single-step mode after the first instruction will be executed. In this case `debug` exception will occured. From this moment pages will be hidden again and execution will be continued. As pages hidden from this moment, page fault exception will occur again and `kmemcheck` continue to check/collect errors again and print them from time to time.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the third part about linux kernel [memory management](https://en.wikipedia.org/wiki/Memory_management). If you have questions or suggestions, ping me on twitter [0xAX](https://twitter.com/0xAX), drop me an [email](anotherworldofworld@gmail.com) or just create an [issue](https://github.com/0xAX/linux-insides/issues/new). In the next part we will see yet another memory debugging related tool - `kmemleak`.
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me a PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [memory management](https://en.wikipedia.org/wiki/Memory_management)
|
||||
* [debugging](https://en.wikipedia.org/wiki/Debugging)
|
||||
* [memory leaks](https://en.wikipedia.org/wiki/Memory_leak)
|
||||
* [kmemcheck documentation](https://www.kernel.org/doc/Documentation/kmemcheck.txt)
|
||||
* [valgrind](https://en.wikipedia.org/wiki/Valgrind)
|
||||
* [paging](https://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html)
|
||||
* [page fault](https://en.wikipedia.org/wiki/Page_fault)
|
||||
* [initcalls](https://0xax.gitbooks.io/linux-insides/content/Concepts/initcall.html)
|
||||
* [opcode](https://en.wikipedia.org/wiki/Opcode)
|
||||
* [translation lookaside buffer](https://en.wikipedia.org/wiki/Translation_lookaside_buffer)
|
||||
* [per-cpu variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [flags register](https://en.wikipedia.org/wiki/FLAGS_register)
|
||||
* [tasklet](https://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html)
|
||||
* [Paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html)
|
Loading…
Reference in new issue