commit
da38b6038d
@ -0,0 +1,473 @@
|
||||
Kernel initialization. Part 10.
|
||||
================================================================================
|
||||
|
||||
End of the linux kernel initialization process
|
||||
================================================================================
|
||||
|
||||
This is tenth part of the chapter about linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and in the [previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-9.html) we saw the initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update) and stopped on the call of the `acpi_early_init` function. This part will be the last part of the [Kernel initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) chapter, so let's finish with it.
|
||||
|
||||
After the call of the `acpi_early_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c), we can see the following code:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_X86_ESPFIX64
|
||||
init_espfix_bsp();
|
||||
#endif
|
||||
```
|
||||
|
||||
Here we can see the call of the `init_espfix_bsp` function which depends on the `CONFIG_X86_ESPFIX64` kernel configuration option. As we can understand from the function name, it does something with the stack. This function defined in the [arch/x86/kernel/espfix_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/espfix_64.c) and prevents leaking of `31:16` bits of the `esp` register during returning to 16-bit stack. First of all we install `espfix` page upper directory into the kernel page directory in the `init_espfix_bs`:
|
||||
|
||||
```C
|
||||
pgd_p = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
|
||||
pgd_populate(&init_mm, pgd_p, (pud_t *)espfix_pud_page);
|
||||
```
|
||||
|
||||
Where `ESPFIX_BASE_ADDR` is:
|
||||
|
||||
```C
|
||||
#define PGDIR_SHIFT 39
|
||||
#define ESPFIX_PGD_ENTRY _AC(-2, UL)
|
||||
#define ESPFIX_BASE_ADDR (ESPFIX_PGD_ENTRY << PGDIR_SHIFT)
|
||||
```
|
||||
|
||||
Also we can find it in the [Documentation/arch/x86_64/mm](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt):
|
||||
|
||||
```
|
||||
... unused hole ...
|
||||
ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
|
||||
... unused hole ...
|
||||
```
|
||||
|
||||
After we've filled page global directory with the `espfix` pud, the next step is call of the `init_espfix_random` and `init_espfix_ap` functions. The first function returns random locations for the `espfix` page and the second enables the `espfix` the current CPU. After the `init_espfix_bsp` finished to work, we can see the call of the `thread_info_cache_init` function which defined in the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c) and allocates cache for the `thread_info` if its size is less than `PAGE_SIZE`:
|
||||
|
||||
```C
|
||||
# if THREAD_SIZE >= PAGE_SIZE
|
||||
...
|
||||
...
|
||||
...
|
||||
void thread_info_cache_init(void)
|
||||
{
|
||||
thread_info_cache = kmem_cache_create("thread_info", THREAD_SIZE,
|
||||
THREAD_SIZE, 0, NULL);
|
||||
BUG_ON(thread_info_cache == NULL);
|
||||
}
|
||||
...
|
||||
...
|
||||
...
|
||||
#endif
|
||||
```
|
||||
|
||||
As we already know the `PAGE_SIZE` is `(_AC(1,UL) << PAGE_SHIFT)` or `4096` bytes and `THREAD_SIZE` is `(PAGE_SIZE << THREAD_SIZE_ORDER)` or `16384` bytes for the `x86_64`. The next function after the `thread_info_cache_init` is the `cred_init` from the [kernel/cred.c](https://github.com/torvalds/linux/blob/master/kernel/cred.c). This function just allocates space for the credentials (like `uid`, `gid` and etc...):
|
||||
|
||||
```C
|
||||
void __init cred_init(void)
|
||||
{
|
||||
cred_jar = kmem_cache_create("cred_jar", sizeof(struct cred),
|
||||
0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
|
||||
}
|
||||
```
|
||||
|
||||
more about credentials you can read in the [Documentation/security/credentials.txt](https://github.com/torvalds/linux/blob/master/Documentation/security/credentials.txt). Next step is the `fork_init` function from the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c). The `fork_init` function allocates space for the `task_struct`. Let's look on the implementation of the `fork_init`. First of all we can see definitions of the `ARCH_MIN_TASKALIGN` macro and creation of a slab where task_structs will be allocated:
|
||||
|
||||
```C
|
||||
#ifndef CONFIG_ARCH_TASK_STRUCT_ALLOCATOR
|
||||
#ifndef ARCH_MIN_TASKALIGN
|
||||
#define ARCH_MIN_TASKALIGN L1_CACHE_BYTES
|
||||
#endif
|
||||
task_struct_cachep =
|
||||
kmem_cache_create("task_struct", sizeof(struct task_struct),
|
||||
ARCH_MIN_TASKALIGN, SLAB_PANIC | SLAB_NOTRACK, NULL);
|
||||
#endif
|
||||
```
|
||||
|
||||
As we can see this code depends on the `CONFIG_ARCH_TASK_STRUCT_ACLLOCATOR` kernel configuration option. This configuration option shows the presence of the `alloc_task_struct` for the given architecture. As `x86_64` has no `alloc_task_struct` function, this code will not work and even will not be compiled on the `x86_64`.
|
||||
|
||||
Allocating cache for init task
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After this we can see the call of the `arch_task_cache_init` function in the `fork_init`:
|
||||
|
||||
```C
|
||||
void arch_task_cache_init(void)
|
||||
{
|
||||
task_xstate_cachep =
|
||||
kmem_cache_create("task_xstate", xstate_size,
|
||||
__alignof__(union thread_xstate),
|
||||
SLAB_PANIC | SLAB_NOTRACK, NULL);
|
||||
setup_xstate_comp();
|
||||
}
|
||||
```
|
||||
|
||||
The `arch_task_cache_init` does initialization of the architecture-specific caches. In our case it is `x86_64`, so as we can see, the `arch_task_cache_init` allocates space for the `task_xstate` which represents [FPU](http://en.wikipedia.org/wiki/Floating-point_unit) state and sets up offsets and sizes of all extended states in [xsave](http://www.felixcloutier.com/x86/XSAVES.html) area with the call of the `setup_xstate_comp` function. After the `arch_task_cache_init` we calculate default maximum number of threads with the:
|
||||
|
||||
```C
|
||||
set_max_threads(MAX_THREADS);
|
||||
```
|
||||
|
||||
where default maximum number of threads is:
|
||||
|
||||
```C
|
||||
#define FUTEX_TID_MASK 0x3fffffff
|
||||
#define MAX_THREADS FUTEX_TID_MASK
|
||||
```
|
||||
|
||||
In the end of the `fork_init` function we initalize [signal](http://www.win.tue.nl/~aeb/linux/lk/lk-5.html) handler:
|
||||
|
||||
```C
|
||||
init_task.signal->rlim[RLIMIT_NPROC].rlim_cur = max_threads/2;
|
||||
init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/2;
|
||||
init_task.signal->rlim[RLIMIT_SIGPENDING] =
|
||||
init_task.signal->rlim[RLIMIT_NPROC];
|
||||
```
|
||||
|
||||
As we know the `init_task` is an instance of the `task_struct` structure, so it contains `signal` field which represents signal handler. It has following type `struct signal_struct`. On the first two lines we can see setting of the current and maximum limit of the `resource limits`. Every process has an associated set of resource limits. These limits specify amount of resources which current process can use. Here `rlim` is resource control limit and presented by the:
|
||||
|
||||
```C
|
||||
struct rlimit {
|
||||
__kernel_ulong_t rlim_cur;
|
||||
__kernel_ulong_t rlim_max;
|
||||
};
|
||||
```
|
||||
|
||||
structure from the [include/uapi/linux/resource.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/resource.h). In our case the resource is the `RLIMIT_NPROC` which is the maximum number of process that use can own and `RLIMIT_SIGPENDING` - the maximum number of pending signals. We can see it in the:
|
||||
|
||||
```C
|
||||
cat /proc/self/limits
|
||||
Limit Soft Limit Hard Limit Units
|
||||
...
|
||||
...
|
||||
...
|
||||
Max processes 63815 63815 processes
|
||||
Max pending signals 63815 63815 signals
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
Initialization of the caches
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next function after the `fork_init` is the `proc_caches_init` from the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c). This function allocates caches for the memory descriptors (or `mm_struct` structure). At the beginning of the `proc_caches_init` we can see allocation of the different [SLAB](http://en.wikipedia.org/wiki/Slab_allocation) caches with the call of the `kmem_cache_create`:
|
||||
|
||||
* `sighand_cachep` - manage information about installed signal handlers;
|
||||
* `signal_cachep` - manage information about process signal descriptor;
|
||||
* `files_cachep` - manage information about opened files;
|
||||
* `fs_cachep` - manage filesystem information.
|
||||
|
||||
After this we allocate `SLAB` cache for the `mm_struct` structures:
|
||||
|
||||
```C
|
||||
mm_cachep = kmem_cache_create("mm_struct",
|
||||
sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
|
||||
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL);
|
||||
```
|
||||
|
||||
|
||||
After this we allocate `SLAB` cache for the important `vm_area_struct` which used by the kernel to manage virtual memory space:
|
||||
|
||||
```C
|
||||
vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC);
|
||||
```
|
||||
|
||||
Note, that we use `KMEM_CACHE` macro here instead of the `kmem_cache_create`. This macro defined in the [include/linux/slab.h](https://github.com/torvalds/linux/blob/master/include/linux/slab.h) and just expands to the `kmem_cache_create` call:
|
||||
|
||||
```C
|
||||
#define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\
|
||||
sizeof(struct __struct), __alignof__(struct __struct),\
|
||||
(__flags), NULL)
|
||||
```
|
||||
|
||||
The `KMEM_CACHE` has one difference from `kmem_cache_create`. Take a look on `__alignof__` operator. The `KMEM_CACHE` macro aligns `SLAB` to the size of the given structure, but `kmem_cache_create` uses given value to align space. After this we can see the call of the `mmap_init` and `nsproxy_cache_init` functions. The first function initalizes virtual memory area `SLAB` and the second function initializes `SLAB` for namespaces.
|
||||
|
||||
The next function after the `proc_caches_init` is `buffer_init`. This function defined in the [fs/buffer.c](https://github.com/torvalds/linux/blob/master/fs/buffer.c) source code file and allocate cache for the `buffer_head`. The `buffer_head` is a special structure which defined in the [include/linux/buffer_head.h](https://github.com/torvalds/linux/blob/master/include/linux/buffer_head.h) and used for managing buffers. In the start of the `bufer_init` function we allocate cache for the `struct buffer_head` structures with the call of the `kmem_cache_create` function as we did it in the previous functions. And calcuate the maximum size of the buffers in memory with:
|
||||
|
||||
```C
|
||||
nrpages = (nr_free_buffer_pages() * 10) / 100;
|
||||
max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head));
|
||||
```
|
||||
|
||||
which will be equal to the `10%` of the `ZONE_NORMAL` (all RAM from the 4GB on the `x86_64`). The next function after the `buffer_init` is - `vfs_caches_init`. This function allocates `SLAB` caches and hashtable for different [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) caches. We already saw the `vfs_caches_init_early` function in the eighth part of the linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-8.html) which initialized caches for `dcache` (or directory-cache) and [inode](http://en.wikipedia.org/wiki/Inode) cache. The `vfs_caches_init` function makes post-early initialization of the `dcache` and `inode` caches, private data cache, hash tables for the mount points and etc... More details about [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) will be described in the separate part. After this we can see `signals_init` function. This function defined in the [kernel/signal.c](https://github.com/torvalds/linux/blob/master/kernel/signal.c) and allocates a cache for the `sigqueue` structures which represents queue of the real time signals. The next function is `page_writeback_init`. This function initializes the ratio for the dirty pages. Every low-level page entry contains the `dirty` bit which indicates whether a page has been written to when set.
|
||||
|
||||
Creation of the root for the procfs
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
After all of this preparations we need to create the root for the [proc](http://en.wikipedia.org/wiki/Procfs) filesystem. We will do it with the call of the `proc_root_init` function from the [fs/proc/root.c](https://github.com/torvalds/linux/blob/master/fs/proc/root.c). At the start of the `proc_root_init` function we allocate the cache for the inodes and register a new filesystem in the system with the:
|
||||
|
||||
```C
|
||||
err = register_filesystem(&proc_fs_type);
|
||||
if (err)
|
||||
return;
|
||||
```
|
||||
|
||||
As I wrote above we will not dive into details about [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) and different filesystems in this chapter, but will see it in the chapter about the `VFS`. After we've registered a new filesystem in the our system, we call the `proc_self_init` function from the TO[fs/proc/self.c](https://github.com/torvalds/linux/blob/master/fs/proc/self.c) and this function allocates `inode` number for the `self` (`/proc/self` directory refers to the process accessing the `/proc` filesystem). The next step after the `proc_self_init` is `proc_setup_thread_self` which setups the `/proc/thread-self` directory which contains information about current thread. After this we create `/proc/self/mounts` symllink which will contains mount points with the call of the
|
||||
|
||||
```C
|
||||
proc_symlink("mounts", NULL, "self/mounts");
|
||||
```
|
||||
|
||||
and a couple of directories depends on the different configuration options:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_SYSVIPC
|
||||
proc_mkdir("sysvipc", NULL);
|
||||
#endif
|
||||
proc_mkdir("fs", NULL);
|
||||
proc_mkdir("driver", NULL);
|
||||
proc_mkdir("fs/nfsd", NULL);
|
||||
#if defined(CONFIG_SUN_OPENPROMFS) || defined(CONFIG_SUN_OPENPROMFS_MODULE)
|
||||
proc_mkdir("openprom", NULL);
|
||||
#endif
|
||||
proc_mkdir("bus", NULL);
|
||||
...
|
||||
...
|
||||
...
|
||||
if (!proc_mkdir("tty", NULL))
|
||||
return;
|
||||
proc_mkdir("tty/ldisc", NULL);
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
In the end of the `proc_root_init` we call the `proc_sys_init` function which creates `/proc/sys` directory and initializes the [Sysctl](http://en.wikipedia.org/wiki/Sysctl).
|
||||
|
||||
It is the end of `start_kernel` function. I did not describe all functions which are called in the `start_kernel`. I missed it, because they are not so improtant for the generic kernel initialization stuff and depend on only different kernel configurations. They are `taskstats_init_early` which exports per-task statistic to the user-space, `delayacct_init` - initializes per-task delay accounting, `key_init` and `security_init` initialize diferent security stuff, `check_bugs` - makes fix up of the some architecture-dependent bugs, `ftrace_init` function executes initialization of the [ftrace](https://www.kernel.org/doc/Documentation/trace/ftrace.txt), `cgroup_init` makes initialization of the rest of the [cgroup](http://en.wikipedia.org/wiki/Cgroups) subsystem and etc... Many of these parts and subsystems will be described in the other chapters.
|
||||
|
||||
That's all. Finally we passed through the long-long `start_kernel` function. But it is not the end of the linux kernel initialization process. We have no runned first process yet. In the end of the `start_kernel` we can see the last call of the - `rest_init` function. Let's go ahead.
|
||||
|
||||
First steps after the start_kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The `rest_init` function defined in the same source code file as `start_kernel` function, and this file is [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). In the beginning of the `rest_init` we can see call of the two following functions:
|
||||
|
||||
```C
|
||||
rcu_scheduler_starting();
|
||||
smpboot_thread_init();
|
||||
```
|
||||
|
||||
The first `rcu_scheduler_starting` makes [RCU](http://en.wikipedia.org/wiki/Read-copy-update) scheduler active and the second `smpboot_thread_init` registers the `smpboot_thread_notifier` CPU notifier (more about it you can read in the [CPU hotplug documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt). After this we can see the following calls:
|
||||
|
||||
```C
|
||||
kernel_thread(kernel_init, NULL, CLONE_FS);
|
||||
pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
|
||||
```
|
||||
|
||||
Here the `kernel_thread` function (defined in the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c)) creates new kernel thread.As we can see the `kernel_thread` function takes three arguments:
|
||||
|
||||
* Function which will be executed in a new thread;
|
||||
* Parameter for the `kernel_init` function;
|
||||
* Flags.
|
||||
|
||||
We will not dive into details about `kernel_thread` implementation (we will see it in the chapter which will describe scheduler, just need to say that `kernel_thread` invokes [clone](http://www.tutorialspoint.com/unix_system_calls/clone.htm)). Now we only need to know that we create new kernel thread with `kernel_thread` function, parent and child of the thread will use shared information about a filesystem and it will start to execute `kernel_init` function. A kernel thread differs from an user thread that it runs in a kernel mode. So with these two `kernel_thread` calls we create two new kernel threads with the `PID = 1` for `init` process and `PID = 2` for `kthread`. We already know what is `init` process. Let's look on the `kthread`. It is special kernel thread which allows to `init` and different parts of the kernel to create another kernel threads. We can see it in the output of the `ps` util:
|
||||
|
||||
```C
|
||||
$ ps -ef | grep kthradd
|
||||
alex 12866 4767 0 18:26 pts/0 00:00:00 grep kthradd
|
||||
```
|
||||
|
||||
Let's postpone `kernel_init` and `kthreadd` for now and will go ahead in the `rest_init`. In the next step after we have created two new kernel threads we can see the following code:
|
||||
|
||||
```C
|
||||
rcu_read_lock();
|
||||
kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
|
||||
rcu_read_unlock();
|
||||
```
|
||||
|
||||
The first `rcu_read_lock` function marks the beginning of an [RCU](http://en.wikipedia.org/wiki/Read-copy-update) read-side critical section and the `rcu_read_unlock` marks the end of an RCU read-side critical section. We call these functions because we need to protect the `find_task_by_pid_ns`. The `find_task_by_pid_ns` returns pointer to the `task_struct` by the given pid. So, here we are getting the pointer to the `task_struct` for the `PID = 2` (we got it after `kthreadd` creation with the `kernel_thread`). In the next step we call `complete` function
|
||||
|
||||
```C
|
||||
complete(&kthreadd_done);
|
||||
```
|
||||
|
||||
and pass address of the `kthreadd_done`. The `kthreadd_done` defined as
|
||||
|
||||
```C
|
||||
static __initdata DECLARE_COMPLETION(kthreadd_done);
|
||||
```
|
||||
|
||||
where `DECLARE_COMPLETION` macro defined as:
|
||||
|
||||
```C
|
||||
#define DECLARE_COMPLETION(work) \
|
||||
struct completion work = COMPLETION_INITIALIZER(work)
|
||||
```
|
||||
|
||||
and expands to the definition of the `completion` structure. This structure defined in the [include/linux/completion.h](https://github.com/torvalds/linux/blob/master/include/linux/completion.h) and presents `completions` concept. Completions are a code synchronization mechanism which is provide race-free solution for the threads that must wait for some process to have reached a point or a specific state. Using completions consists of three parts: The first is definition of the `complete` structure and we did it with the `DECLARE_COMPLETION`. The second is call of the `wait_for_completion`. After the call of this function, a thread which called it will not continue to execute and will wait while other thread did not call `complete` function. Note that we call `wait_for_completion` with the `kthreadd_done` in the beginning of the `kernel_init_freeable`:
|
||||
|
||||
```C
|
||||
wait_for_completion(&kthreadd_done);
|
||||
```
|
||||
|
||||
And the last step is to call `complete` function as we saw it above. After this the `kernel_init_freeable` function will not be executed while `kthreadd` thread will not be set. After the `kthreadd` was set, we can see three following functions in the `rest_init`:
|
||||
|
||||
```C
|
||||
init_idle_bootup_task(current);
|
||||
schedule_preempt_disabled();
|
||||
cpu_startup_entry(CPUHP_ONLINE);
|
||||
```
|
||||
|
||||
The first `init_idle_bootup_task` function from the [kernel/sched/core.c](https://github.com/torvalds/linux/blob/master/kernel/sched/core.c) sets the Scheduling class for the current process (`idle` class in our case):
|
||||
|
||||
```C
|
||||
void init_idle_bootup_task(struct task_struct *idle)
|
||||
{
|
||||
idle->sched_class = &idle_sched_class;
|
||||
}
|
||||
```
|
||||
|
||||
where `idle` class is a low priority tasks and tasks can be runned only when the processor has not to run anything besides this tasks. The second function `schedule_preempt_disabled` disables preempt in `idle` tasks. And the third function `cpu_startup_entry` defined in the [kernel/sched/idle.c](https://github.com/torvalds/linux/blob/master/sched/idle.c) and calls `cpu_idle_loop` from the [kernel/sched/idle.c](https://github.com/torvalds/linux/blob/master/sched/idle.c). The `cpu_idle_loop` function works as process with `PID = 0` and works in the background. Main purpose of the `cpu_idle_loop` is usage of the idle CPU cycles. When there are no one process to run, this process starts to work. We have one process with `idle` scheduling class (we just set the `current` task to the `idle` with the call of the `init_idle_bootup_task` function), so the `idle` thread does not do useful work and checks that there is not active task to switch:
|
||||
|
||||
```C
|
||||
static void cpu_idle_loop(void)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
while (1) {
|
||||
while (!need_resched()) {
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
More about it will be in the chapter about scheduler. So for this moment the `start_kernel` calls the `rest_init` function which spawns an `init` (`kernel_init` function) process and become `idle` process itself. Now is time to look on the `kernel_init`. Execution of the `kernel_init` function starts from the call of the `kernel_init_freeable` function. The `kernel_init_freeable` function first of all waits for the completion of the `kthreadd` setup. I already wrote about it above:
|
||||
|
||||
```C
|
||||
wait_for_completion(&kthreadd_done);
|
||||
```
|
||||
|
||||
After this we set `gfp_allowed_mask` to `__GFP_BITS_MASK` which means that already system is running, set allowed [cpus/mems](https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt) to all CPUs and [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) nodes with the `set_mems_allowed` function, allow `init` process to run on any CPU with the `set_cpus_allowed_ptr`, set pid for the `cad` or `Ctrl-Alt-Delete`, do preparation for booting of the other CPUs with the call of the `smp_prepare_cpus`, call early [initcalls](http://kernelnewbies.org/Documents/InitcallMechanism) with the `do_pre_smp_initcalls`, initialization of the `SMP` with the `smp_init` and initialization of the [lockup_detector](https://www.kernel.org/doc/Documentation/lockup-watchdogs.txt) with the call of the `lockup_detector_init` and initialize scheduler with the `sched_init_smp`.
|
||||
|
||||
After this we can see the call of the following functions - `do_basic_setup`. Before we will call the `do_basic_setup` function, our kernel already initialized for this moment. As comment says:
|
||||
|
||||
```
|
||||
Now we can finally start doing some real work..
|
||||
```
|
||||
|
||||
The `do_basic_setup` will reinitialize [cpuset](https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt) to the active CPUs, initialization of the `khelper` - which is a kernel thread which used for making calls out to userspace from within the kernel, initialize [tmpfs](http://en.wikipedia.org/wiki/Tmpfs), initialize `drivers` subsystem, enable the user-mode helper `workqueue` and make post-early call of the `initcalls`. We can see openinng of the `dev/console` and dup twice file descriptors from `0` to `2` after the `do_basic_setup`:
|
||||
|
||||
|
||||
```C
|
||||
if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
|
||||
pr_err("Warning: unable to open an initial console.\n");
|
||||
|
||||
(void) sys_dup(0);
|
||||
(void) sys_dup(0);
|
||||
```
|
||||
|
||||
We are using two system calls here `sys_open` and `sys_dup`. In the next chapters we will see explanation and implementation of the different system calls. After we opened initial console, we check that `rdinit=` option was passed to the kernel command line or set default path of the ramdisk:
|
||||
|
||||
```C
|
||||
if (!ramdisk_execute_command)
|
||||
ramdisk_execute_command = "/init";
|
||||
```
|
||||
|
||||
Check user's permissions for the `ramdisk` and call the `prepare_namespace` function from the [init/do_mounts.c](https://github.com/torvalds/linux/blob/master/init/do_mounts.c) which checks and mounts the [initrd](http://en.wikipedia.org/wiki/Initrd):
|
||||
|
||||
```C
|
||||
if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) {
|
||||
ramdisk_execute_command = NULL;
|
||||
prepare_namespace();
|
||||
}
|
||||
```
|
||||
|
||||
This is the end of the `kernel_init_freeable` function and we need return to the `kernel_init`. The next step after the `kernel_init_freeable` finished its execution is the `async_synchronize_full`. This function waits until all asynchronous function calls have been done and after it we will call the `free_initmem` which will release all memory occupied by the initialization stuff which located between `__init_begin` and `__init_end`. After this we protect `.rodata` with the `mark_rodata_ro` and update state of the system from the `SYSTEM_BOOTING` to the
|
||||
|
||||
```C
|
||||
system_state = SYSTEM_RUNNING;
|
||||
```
|
||||
|
||||
And tries to run the `init` process:
|
||||
|
||||
```C
|
||||
if (ramdisk_execute_command) {
|
||||
ret = run_init_process(ramdisk_execute_command);
|
||||
if (!ret)
|
||||
return 0;
|
||||
pr_err("Failed to execute %s (error %d)\n",
|
||||
ramdisk_execute_command, ret);
|
||||
}
|
||||
```
|
||||
|
||||
First of all it checks the `ramdisk_execute_command` which we set in the `kernel_init_freeable` function and it will be equal to the value of the `rdinit=` kernel command line parameters or `/init` by default. The `run_init_process` function fills the first element of the `argv_init` array:
|
||||
|
||||
```C
|
||||
static const char *argv_init[MAX_INIT_ARGS+2] = { "init", NULL, };
|
||||
```
|
||||
|
||||
which represents arguments of the `init` program and call `do_execve` function:
|
||||
|
||||
```C
|
||||
argv_init[0] = init_filename;
|
||||
return do_execve(getname_kernel(init_filename),
|
||||
(const char __user *const __user *)argv_init,
|
||||
(const char __user *const __user *)envp_init);
|
||||
```
|
||||
|
||||
The `do_execve` function defined in the [include/linux/sched.h](https://github.com/torvalds/linux/blob/master/include/linux/sched.h) and runs program with the given file name and arguments. If we did not pass `rdinit=` option to the kernel command line, kernel starts to check the `execute_command` which is equal to value of the `init=` kernel command line parameter:
|
||||
|
||||
```C
|
||||
if (execute_command) {
|
||||
ret = run_init_process(execute_command);
|
||||
if (!ret)
|
||||
return 0;
|
||||
panic("Requested init %s failed (error %d).",
|
||||
execute_command, ret);
|
||||
}
|
||||
```
|
||||
|
||||
If we did not pass `init=` kernel command line parameter too, kernel tries to run one of the following executable files:
|
||||
|
||||
```C
|
||||
if (!try_to_run_init_process("/sbin/init") ||
|
||||
!try_to_run_init_process("/etc/init") ||
|
||||
!try_to_run_init_process("/bin/init") ||
|
||||
!try_to_run_init_process("/bin/sh"))
|
||||
return 0;
|
||||
```
|
||||
|
||||
In other way we finish with [panic](http://en.wikipedia.org/wiki/Kernel_panic):
|
||||
|
||||
```C
|
||||
panic("No working init found. Try passing init= option to kernel. "
|
||||
"See Linux Documentation/init.txt for guidance.");
|
||||
```
|
||||
|
||||
That's all! Linux kernel initialization process is finished!
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the tenth part about the linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html). And it is not only `tenth` part, but this is the last part which describes initialization of the linux kernel. As I wrote in the first [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of this chapter, we will go through all steps of the kernel initialization and we did it. We started at the first architecture-independent function - `start_kernel` and finished with the launch of the first `init` process in the our system. I missed details about different subsystem of the kernel, for example I almost did not cover linux kernel scheduler or we did not see almost anything about interrupts and exceptions handling and etc... From the next part we will start to dive to the different kernel subsystems. Hope it will be interesting.
|
||||
|
||||
If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [SLAB](http://en.wikipedia.org/wiki/Slab_allocation)
|
||||
* [xsave](http://www.felixcloutier.com/x86/XSAVES.html)
|
||||
* [FPU](http://en.wikipedia.org/wiki/Floating-point_unit)
|
||||
* [Documentation/security/credentials.txt](https://github.com/torvalds/linux/blob/master/Documentation/security/credentials.txt)
|
||||
* [Documentation/x86/x86_64/mm](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt)
|
||||
* [RCU](http://en.wikipedia.org/wiki/Read-copy-update)
|
||||
* [VFS](http://en.wikipedia.org/wiki/Virtual_file_system)
|
||||
* [inode](http://en.wikipedia.org/wiki/Inode)
|
||||
* [proc](http://en.wikipedia.org/wiki/Procfs)
|
||||
* [man proc](http://linux.die.net/man/5/proc)
|
||||
* [Sysctl](http://en.wikipedia.org/wiki/Sysctl)
|
||||
* [ftrace](https://www.kernel.org/doc/Documentation/trace/ftrace.txt)
|
||||
* [cgroup](http://en.wikipedia.org/wiki/Cgroups)
|
||||
* [CPU hotplug documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)
|
||||
* [completions - wait for completion handling](https://www.kernel.org/doc/Documentation/scheduler/completion.txt)
|
||||
* [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access)
|
||||
* [cpus/mems](https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt)
|
||||
* [initcalls](http://kernelnewbies.org/Documents/InitcallMechanism)
|
||||
* [Tmpfs](http://en.wikipedia.org/wiki/Tmpfs)
|
||||
* [initrd](http://en.wikipedia.org/wiki/Initrd)
|
||||
* [panic](http://en.wikipedia.org/wiki/Kernel_panic)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-9.html)
|
@ -0,0 +1,481 @@
|
||||
Kernel initialization. Part 8.
|
||||
================================================================================
|
||||
|
||||
Scheduler initialization
|
||||
================================================================================
|
||||
|
||||
This is the eighth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) of the Linux kernel initialization process and we stopped on the `setup_nr_cpu_ids` function in the [previous](https://github.com/0xAX/linux-insides/blob/master/Initialization/linux-initialization-7.md) part. The main point of the current part is [scheduler](http://en.wikipedia.org/wiki/Scheduling_%28computing%29) initialization. But before we will start to learn initialization process of the scheduler, we need to do some stuff. The next step in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) is the `setup_per_cpu_areas` function. This function setups areas for the `percpu` variables, more about it you can read in the special part about the [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html). After `percpu` areas up and running, the next step is the `smp_prepare_boot_cpu` function. This function does some preparations for the [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing):
|
||||
|
||||
```C
|
||||
static inline void smp_prepare_boot_cpu(void)
|
||||
{
|
||||
smp_ops.smp_prepare_boot_cpu();
|
||||
}
|
||||
```
|
||||
|
||||
where the `smp_prepare_boot_cpu` expands to the call of the `native_smp_prepare_boot_cpu` function (more about `smp_ops` will be in the special parts about `SMP`):
|
||||
|
||||
```C
|
||||
void __init native_smp_prepare_boot_cpu(void)
|
||||
{
|
||||
int me = smp_processor_id();
|
||||
switch_to_new_gdt(me);
|
||||
cpumask_set_cpu(me, cpu_callout_mask);
|
||||
per_cpu(cpu_state, me) = CPU_ONLINE;
|
||||
}
|
||||
```
|
||||
|
||||
The `native_smp_prepare_boot_cpu` function gets the number of the current CPU (which is Bootstrap processor and its `id` is zero) with the `smp_processor_id` function. I will not explain how the `smp_processor_id` works, because we alread saw it in the [Kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) part. As we got processor `id` number we reload [Global Descriptor Table](http://en.wikipedia.org/wiki/Global_Descriptor_Table) for the given CPU with the `switch_to_new_gdt` function:
|
||||
|
||||
```C
|
||||
void switch_to_new_gdt(int cpu)
|
||||
{
|
||||
struct desc_ptr gdt_descr;
|
||||
|
||||
gdt_descr.address = (long)get_cpu_gdt_table(cpu);
|
||||
gdt_descr.size = GDT_SIZE - 1;
|
||||
load_gdt(&gdt_descr);
|
||||
load_percpu_segment(cpu);
|
||||
}
|
||||
```
|
||||
|
||||
The `gdt_descr` variable represents pointer to the `GDT` descriptor here (we already saw `desc_ptr` in the [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html)). We get the address and the size of the `GDT` descriptor where `GDT_SIZE` is `256` or:
|
||||
|
||||
```C
|
||||
#define GDT_SIZE (GDT_ENTRIES * 8)
|
||||
```
|
||||
|
||||
and the address of the descriptor we will get with the `get_cpu_gdt_table`:
|
||||
|
||||
```C
|
||||
static inline struct desc_struct *get_cpu_gdt_table(unsigned int cpu)
|
||||
{
|
||||
return per_cpu(gdt_page, cpu).gdt;
|
||||
}
|
||||
```
|
||||
|
||||
The `get_cpu_gdt_table` uses `per_cpu` macro for getting `gdt_page` percpu variable for the given CPU number (bootstrap processor with `id` - 0 in our case). You can ask the following question: so, if we can access `gdt_page` percpu variable, where it was defined? Actually we alread saw it in this book. If you have read the first [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) of this chapter, you can remember that we saw definition of the `gdt_page` in the [arch/x86/kernel/head_64.S](https://github.com/0xAX/linux/blob/master/arch/x86/kernel/head_64.S):
|
||||
|
||||
```assembly
|
||||
early_gdt_descr:
|
||||
.word GDT_ENTRIES*8-1
|
||||
early_gdt_descr_base:
|
||||
.quad INIT_PER_CPU_VAR(gdt_page)
|
||||
```
|
||||
|
||||
and if we will look on the [linker](https://github.com/0xAX/linux/blob/master/arch/x86/kernel/vmlinux.lds.S) file we can see that it locates after the `__per_cpu_load` symbol:
|
||||
|
||||
```C
|
||||
#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load
|
||||
INIT_PER_CPU(gdt_page);
|
||||
```
|
||||
|
||||
and filled `gdt_page` in the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c#L94):
|
||||
|
||||
```C
|
||||
DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
|
||||
#ifdef CONFIG_X86_64
|
||||
[GDT_ENTRY_KERNEL32_CS] = GDT_ENTRY_INIT(0xc09b, 0, 0xfffff),
|
||||
[GDT_ENTRY_KERNEL_CS] = GDT_ENTRY_INIT(0xa09b, 0, 0xfffff),
|
||||
[GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(0xc093, 0, 0xfffff),
|
||||
[GDT_ENTRY_DEFAULT_USER32_CS] = GDT_ENTRY_INIT(0xc0fb, 0, 0xfffff),
|
||||
[GDT_ENTRY_DEFAULT_USER_DS] = GDT_ENTRY_INIT(0xc0f3, 0, 0xfffff),
|
||||
[GDT_ENTRY_DEFAULT_USER_CS] = GDT_ENTRY_INIT(0xa0fb, 0, 0xfffff),
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
more about `percpu` variables you can read in the [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) part. As we got address and size of the `GDT` descriptor we case reload `GDT` with the `load_gdt` which just execute `lgdt` instruct and load `percpu_segment` with the following function:
|
||||
|
||||
```C
|
||||
void load_percpu_segment(int cpu) {
|
||||
loadsegment(gs, 0);
|
||||
wrmsrl(MSR_GS_BASE, (unsigned long)per_cpu(irq_stack_union.gs_base, cpu));
|
||||
load_stack_canary_segment();
|
||||
}
|
||||
```
|
||||
|
||||
The base address of the `percpu` area must contain `gs` register (or `fs` register for `x86`), so we are using `loadsegment` macro and pass `gs`. In the next step we writes the base address if the [IRQ](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) stack and setup stack [canary](http://en.wikipedia.org/wiki/Buffer_overflow_protection) (this is only for `x86_32`). After we load new `GDT`, we fill `cpu_callout_mask` bitmap with the current cpu and set cpu state as online with the setting `cpu_state` percpu variable for the current processor - `CPU_ONLINE`:
|
||||
|
||||
```C
|
||||
cpumask_set_cpu(me, cpu_callout_mask);
|
||||
per_cpu(cpu_state, me) = CPU_ONLINE;
|
||||
```
|
||||
|
||||
So, what is it `cpu_callout_mask` bitmap... As we initialized bootstrap processor (procesoor which is booted the first on `x86`) the other processors in a multiprocessor system are known as `secondary processors`. Linux kernel uses two following bitmasks:
|
||||
|
||||
* `cpu_callout_mask`
|
||||
* `cpu_callin_mask`
|
||||
|
||||
After bootstrap processor initialized, it updates the `cpu_callout_mask` to indicate which secondary processor can be initialized next. All other or secondary processors can do some initialization stuff before and check the `cpu_callout_mask` on the boostrap processor bit. Only after the bootstrap processor filled the `cpu_callout_mask` this secondary processor, it will continue the rest of its initialization. After that the certain processor will finish its initialization process, the processor sets bit in the `cpu_callin_mask`. Once the bootstrap processor finds the bit in the `cpu_callin_mask` for the current secondary processor, this processor repeats the same procedure for initialization of the rest of a secondary processors. In a short words it works as i described, but more details we will see in the chapter about `SMP`.
|
||||
|
||||
That's all. We did all `SMP` boot preparation.
|
||||
|
||||
Build zonelists
|
||||
-----------------------------------------------------------------------
|
||||
|
||||
In the next step we can see the call of the `build_all_zonelists` function. This function sets up the order of zones that allocations are preferred from. What are zones and what's order we will understand now. For the start let's see how linux kernel considers physical memory. Physical memory may be arranged into banks which are called - `nodes`. If you has no hardware with support for `NUMA`, you will see only one node:
|
||||
|
||||
```
|
||||
$ cat /sys/devices/system/node/node0/numastat
|
||||
numa_hit 72452442
|
||||
numa_miss 0
|
||||
numa_foreign 0
|
||||
interleave_hit 12925
|
||||
local_node 72452442
|
||||
other_node 0
|
||||
```
|
||||
|
||||
Every `node` presented by the `struct pglist data` in the linux kernel. Each node devided into a number of special blocks which are called - `zones`. Every zone presented by the `zone struct` in the linux kernel and has one of the type:
|
||||
|
||||
* `ZONE_DMA` - 0-16M;
|
||||
* `ZONE_DMA32` - used for 32 bit devices that can only do DMA areas below 4G;
|
||||
* `ZONE_NORMAL` - all RAM from the 4GB on the `x86_64`;
|
||||
* `ZONE_HIGHMEM` - absent on the `x86_64`;
|
||||
* `ZONE_MOVABLE` - zone which contains movable pages.
|
||||
|
||||
which are presented by the `zone_type` enum. Information about zones we can get with the:
|
||||
|
||||
```
|
||||
$ cat /proc/zoneinfo
|
||||
Node 0, zone DMA
|
||||
pages free 3975
|
||||
min 3
|
||||
low 3
|
||||
...
|
||||
...
|
||||
Node 0, zone DMA32
|
||||
pages free 694163
|
||||
min 875
|
||||
low 1093
|
||||
...
|
||||
...
|
||||
Node 0, zone Normal
|
||||
pages free 2529995
|
||||
min 3146
|
||||
low 3932
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
As I wrote above all nodes are described with the `pglist_data` or `pg_data_t` structure in memory. This structure defined in the [include/linux/mmzone.h](https://github.com/torvalds/linux/blob/master/include/linux/mmzone.h). The `build_all_zonelists` function from the [mm/page_alloc.c](https://github.com/torvalds/linux/blob/master/mm/page_alloc.c) constructs an ordered `zonelist` (of different zones `DMA`, `DMA32`, `NORMAL`, `HIGH_MEMORY`, `MOVABLE`) which specifies the zones/nodes to visit when a selected `zone` or `node` cannot satisfy the allocation request. That's all. More about `NUMA` and multiprocessor systems will be in the special part.
|
||||
|
||||
The rest of the stuff before scheduler initialization
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Before we will start to dive into linux kernel scheduler initialization process we must to do a couple of things. The fisrt thing is the `page_alloc_init` function from the [mm/page_alloc.c](https://github.com/torvalds/linux/blob/master/mm/page_alloc.c). This function looks pretty easy:
|
||||
|
||||
```C
|
||||
void __init page_alloc_init(void)
|
||||
{
|
||||
hotcpu_notifier(page_alloc_cpu_notify, 0);
|
||||
}
|
||||
```
|
||||
|
||||
and initializes handler for the `CPU` [hotplug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt). Of course the `hotcpu_notifier` depends on the
|
||||
`CONFIG_HOTPLUG_CPU` configuration option and if this option is set, it just calls `cpu_notifier` macro which expands to the call of the `register_cpu_notifier` which adds hotplug cpu handler (`page_alloc_cpu_notify` in our case).
|
||||
|
||||
After this we can see the kernel command line in the initialization output:
|
||||
|
||||
![kernel command line](http://oi58.tinypic.com/2m7vz10.jpg)
|
||||
|
||||
And a couple of functions as `parse_early_param` and `parse_args` which are handles linux kernel command line. You can remember that we already saw the call of the `parse_early_param` function in the sixth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-6.html) of the kernel initialization chapter, so why we call it again? Answer is simple: we call this function in the architecture-specific code (`x86_64` in our case), but not all architecture calls this function. And we need in the call of the second function `parse_args` to parse and handle non-early command line arguments.
|
||||
|
||||
In the next step we can see the call of the `jump_label_init` from the [kernel/jump_label.c](https://github.com/torvalds/linux/blob/master/kernel/jump_label.c). and initializes [jump label](https://lwn.net/Articles/412072/).
|
||||
|
||||
After this we can see the call of the `setup_log_buf` function which setups the [printk](http://www.makelinux.net/books/lkd2/ch18lev1sec3) log buffer. We already saw this function in the seventh [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-7.html) of the linux kernel initialization process chapter.
|
||||
|
||||
PID hash initialization
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next is `pidhash_init` function. As you know an each process has assigned unique number which called - `process identification number` or `PID`. Each process generated with fork or clone is automatically assigned a new unique `PID` value by the kernel. The management of `PIDs` centered around the two special data structures: `struct pid` and `struct upid`. First structure represents information about a `PID` in the kernel. The second structure represents the information that is visible in a specific namespace. All `PID` instances stored in the special hash table:
|
||||
|
||||
```C
|
||||
static struct hlist_head *pid_hash;
|
||||
```
|
||||
|
||||
This hash table is used to find the pid instance that belongs to a numeric `PID` value. So, `pidhash_init` initializes this hash. In the start of the `pidhash_init` function we can see the call of the `alloc_large_system_hash`:
|
||||
|
||||
```C
|
||||
pid_hash = alloc_large_system_hash("PID", sizeof(*pid_hash), 0, 18,
|
||||
HASH_EARLY | HASH_SMALL,
|
||||
&pidhash_shift, NULL,
|
||||
0, 4096);
|
||||
```
|
||||
|
||||
The number of elements of the `pid_hash` depends on the `RAM` configuration, but it can be between `2^4` and `2^12`. The `pidhash_init` computes the size
|
||||
and allocates the required storage (which is `hlist` in our case - the same as [doubly linked list](http://0xax.gitbooks.io/linux-insides/content/DataStructures/dlist.html), but contains one pointer instead on the [struct hlist_head](https://github.com/torvalds/linux/blob/master/include/linux/types.h)]. The `alloc_large_system_hash` function allocates a large system hash table with `memblock_virt_alloc_nopanic` if we pass `HASH_EARLY` flag (as it in our case) or with `__vmalloc` if we did no pass this flag.
|
||||
|
||||
The result we can see in the `dmesg` output:
|
||||
|
||||
```
|
||||
$ dmesg | grep hash
|
||||
[ 0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
That's all. The rest of the stuff before scheduler initialization is the following functions: `vfs_caches_init_early` does early initialization of the [virtual file system](http://en.wikipedia.org/wiki/Virtual_file_system) (more about it will be in the chapter which will describe virtual file system), `sort_main_extable` sorts the kernel's built-in exception table entries which are between `__start___ex_table` and `__stop___ex_table,`, and `trap_init` initializies trap handlers (morea about last two function we will know in the separate chapter about interrupts).
|
||||
|
||||
The last step before the scheduler initialization is initialization of the memory manager with the `mm_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). As we can see, the `mm_init` function initializes different part of the linux kernel memory manager:
|
||||
|
||||
```C
|
||||
page_ext_init_flatmem();
|
||||
mem_init();
|
||||
kmem_cache_init();
|
||||
percpu_init_late();
|
||||
pgtable_init();
|
||||
vmalloc_init();
|
||||
```
|
||||
|
||||
The first is `page_ext_init_flatmem` depends on the `CONFIG_SPARSEMEM` kernel configuration option and initializes extended data per page handling. The `mem_init` releases all `bootmem`, the `kmem_cache_init` initializes kernel cache, the `percpu_init_late` - replaces `percpu` chunks with those allocated by [slub](http://en.wikipedia.org/wiki/SLUB_%28software%29), the `pgtable_init` - initilizes the `vmalloc_init` - initializes `vmalloc`. Please, **NOTE** that we will not dive into details about all of these functions and concepts, but we will see all of they it in the [Linux kernem memory manager](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) chapter.
|
||||
|
||||
That's all. Now we can look on the `scheduler`.
|
||||
|
||||
Scheduler initialization
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
And now we came to the main purpose of this part - initialization of the task scheduler. I want to say again as I did it already many times, you will not see the full explanation of the scheduler here, there will be special chapter about this. Ok, next point is the `sched_init` function from the [kernel/sched/core.c](https://github.com/torvalds/linux/blob/master/kernel/sched/core.c) and as we can understand from the function's name, it initializes scheduler. Let's start to dive in this function and try to understand how the scheduler initialized. At the start of the `sched_init` function we can see the following code:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_FAIR_GROUP_SCHED
|
||||
alloc_size += 2 * nr_cpu_ids * sizeof(void **);
|
||||
#endif
|
||||
#ifdef CONFIG_RT_GROUP_SCHED
|
||||
alloc_size += 2 * nr_cpu_ids * sizeof(void **);
|
||||
#endif
|
||||
```
|
||||
|
||||
First of all we can see two configuration options here:
|
||||
|
||||
* `CONFIG_FAIR_GROUP_SCHED`
|
||||
* `CONFIG_RT_GROUP_SCHED`
|
||||
|
||||
Both of this options provide two different planning models. As we can read from the [documentation](https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt), the current scheduler - `CFS` or `Completely Fair Scheduler` used a simple concept. It models process scheduling as if the system had an ideal multitasking processor where each process would receive `1/n` processor time, where `n` is the number of the runnable processes. The scheduler uses the special set of rules used. These rules determine when and how to select a new process to run and they are called `scheduling policy`. The Completely Fair Scheduler supports following `normal` or `non-real-time` scheduling policies: `SCHED_NORMAL`, `SCHED_BATCH` and `SCHED_IDLE`. The `SCHED_NORMAL` is used for the most normal applications, the amount of cpu each process consumes is mostly determined by the [nice](http://en.wikipedia.org/wiki/Nice_%28Unix%29) value, the `SCHED_BATCH` used for the 100% non-interactive tasks and the `SCHED_IDLE` runs tasks only when the processor has not to run anything besides this task. The `real-time` policies are also supported for the time-critial applications: `SCHED_FIFO` and `SCHED_RR`. If you've read something about the Linux kernel scheduler, you can know that it is modular. It means that it supports different algorithms to schedule different types of processes. Usually this modularity is called `scheduler classes`. These modules encapsulate scheduling policy details and are handled by the scheduler core without the core code assuming too much about them.
|
||||
|
||||
|
||||
Now let's back to the our code and look on the two configuration options `CONFIG_FAIR_GROUP_SCHED` and `CONFIG_RT_GROUP_SCHED`. The scheduler operates on an individual task. These options allows to schedule group tasks (more about it you can read in the [CFS group scheduling](http://lwn.net/Articles/240474/)). We can see that we assign the `alloc_size` variables which represent size based on amount of the processors to allocate for the `sched_entity` and `cfs_rq` to the `2 * nr_cpu_ids * sizeof(void **)` expression with `kzalloc`:
|
||||
|
||||
```C
|
||||
ptr = (unsigned long)kzalloc(alloc_size, GFP_NOWAIT);
|
||||
|
||||
#ifdef CONFIG_FAIR_GROUP_SCHED
|
||||
root_task_group.se = (struct sched_entity **)ptr;
|
||||
ptr += nr_cpu_ids * sizeof(void **);
|
||||
|
||||
root_task_group.cfs_rq = (struct cfs_rq **)ptr;
|
||||
ptr += nr_cpu_ids * sizeof(void **);
|
||||
#endif
|
||||
|
||||
```
|
||||
|
||||
The `sched_entity` is struture which defined in the [include/linux/sched.h](https://github.com/torvalds/linux/blob/master/include/linux/sched.h) and used by the scheduler to keep track of process accounting. The `cfs_rq` presents [run queue](http://en.wikipedia.org/wiki/Run_queue). So, you can see that we allocated space with size `alloc_size` for the run queue and scheduler entity of the `root_task_group`. The `root_task_group` is an instance of the `task_group` structure from the [kernel/sched/sched.h](https://github.com/torvalds/linux/blob/master/kernel/sched/sched.h) which contains task group related information:
|
||||
|
||||
```C
|
||||
struct task_group {
|
||||
...
|
||||
...
|
||||
struct sched_entity **se;
|
||||
struct cfs_rq **cfs_rq;
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
The root task group is the task group which belongs every task in system. As we allocated space for the root task group scheduler entity and runqueue, we go over all possible CPUs (`cpu_possible_mask` bitmap) and allocate zeroed memory from a particular memory node with the `kzalloc_node` function for the `load_balance_mask` `percpu` variable:
|
||||
|
||||
```C
|
||||
DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
|
||||
```
|
||||
|
||||
Here `cpumask_var_t` is the `cpumask_t` with one difference: `cpumask_var_t` is allocated only `nr_cpu_ids` bits when the `cpumask_t` always has `NR_CPUS` bits (more about `cpumask` you can read in the [CPU masks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html) part). As you can see:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_CPUMASK_OFFSTACK
|
||||
for_each_possible_cpu(i) {
|
||||
per_cpu(load_balance_mask, i) = (cpumask_var_t)kzalloc_node(
|
||||
cpumask_size(), GFP_KERNEL, cpu_to_node(i));
|
||||
}
|
||||
#endif
|
||||
```
|
||||
|
||||
this code depends on the `CONFIG_CPUMASK_OFFSTACK` configuration option. This configuration options says to use dynamic allocation for `cpumask`, instead of putting it on the stack. All groups have to be able to rely on the amount of CPU time. With the call of the two following functions:
|
||||
|
||||
```C
|
||||
init_rt_bandwidth(&def_rt_bandwidth,
|
||||
global_rt_period(), global_rt_runtime());
|
||||
init_dl_bandwidth(&def_dl_bandwidth,
|
||||
global_rt_period(), global_rt_runtime());
|
||||
```
|
||||
|
||||
we initialize bandwidth management for the `SCHED_DEADLINE` real-time tasks. These functions initializes `rt_bandwidth` and `dl_bandwidth` structures which are store information about maximum `deadline` bandwith of the system. For example, let's look on the implementation of the `init_rt_bandwidth` function:
|
||||
|
||||
```C
|
||||
void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime)
|
||||
{
|
||||
rt_b->rt_period = ns_to_ktime(period);
|
||||
rt_b->rt_runtime = runtime;
|
||||
|
||||
raw_spin_lock_init(&rt_b->rt_runtime_lock);
|
||||
|
||||
hrtimer_init(&rt_b->rt_period_timer,
|
||||
CLOCK_MONOTONIC, HRTIMER_MODE_REL);
|
||||
rt_b->rt_period_timer.function = sched_rt_period_timer;
|
||||
}
|
||||
```
|
||||
|
||||
It takes three parameters:
|
||||
|
||||
* address of the `rt_bandwidth` structure which contains information about the allocated and consumed quota within a period;
|
||||
* `period` - period over which real-time task bandwidth enforcement is measured in `us`;
|
||||
* `runtime` - part of the period that we allow tasks to run in `us`.
|
||||
|
||||
As `period` and `runtime` we pass result of the `global_rt_period` and `global_rt_runtime` functions. Which are `1s` second and and `0.95s` by default. The `rt_bandwidth` structure defined in the [kernel/sched/sched.h](https://github.com/torvalds/linux/blob/master/kernel/sched/sched.h) and looks:
|
||||
|
||||
```C
|
||||
struct rt_bandwidth {
|
||||
raw_spinlock_t rt_runtime_lock;
|
||||
ktime_t rt_period;
|
||||
u64 rt_runtime;
|
||||
struct hrtimer rt_period_timer;
|
||||
};
|
||||
```
|
||||
|
||||
As you can see, it contains `runtime` and `period` and also two following fields:
|
||||
|
||||
* `rt_runtime_lock` - [spinlock](http://en.wikipedia.org/wiki/Spinlock) for the `rt_time` protection;
|
||||
* `rt_period_timer` - [high-resolution kernel timer](https://www.kernel.org/doc/Documentation/timers/hrtimers.txt) for unthrottled of real-time tasks.
|
||||
|
||||
So, in the `init_rt_bandwidth` we initialize `rt_bandwidth` period and runtime with the given parameters, initialize the spinlock and high-resolution time. In the next step, depends on the enabled [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing), we make initialization of the root domain:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_SMP
|
||||
init_defrootdomain();
|
||||
#endif
|
||||
```
|
||||
|
||||
The real-time scheduler requires global resources to make scheduling decision. But unfortenatelly scalability bottlenecks appear as the number of CPUs increase. The concept of root domains was introduced for improving scalability. The linux kernel provides special mechanism for assigning a set of CPUs and memory nodes to a set of task and it is called - `cpuset`. If a `cpuset` contains non-overlapping with other `cpuset` CPUs, it is `exclusive cpuset`. Each exclusive cpuset defines an isolated domain or `root domain` of CPUs partitioned from other cpusets or CPUs. A `root domain` presented by the `struct root_domain` from the [kernel/sched/sched.h](https://github.com/torvalds/linux/blob/master/kernel/sched/sched.h) in the linux kernel and its main purpose is to narrow the scope of the global variables to per-domain variables and all real-time scheduling decisions are made only within the scope of a root domain. That's all about it, but we will see more details about it in the chapter about scheduling about real-time scheduler.
|
||||
|
||||
After `root domain` initialization, we make initialization of the bandwidth for the real-time tasks of the root task group as we did it above:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_RT_GROUP_SCHED
|
||||
init_rt_bandwidth(&root_task_group.rt_bandwidth,
|
||||
global_rt_period(), global_rt_runtime());
|
||||
#endif
|
||||
```
|
||||
|
||||
In the next step, depends on the `CONFIG_CGROUP_SCHED` kernel configuration option we initialze the `siblings` and `children` lists of the root task group. As we can read from the documentation, the `CONFIG_CGROUP_SCHED` is:
|
||||
|
||||
```
|
||||
This option allows you to create arbitrary task groups using the "cgroup" pseudo
|
||||
filesystem and control the cpu bandwidth allocated to each such task group.
|
||||
```
|
||||
|
||||
As we finished with the lists initialization, we can see the call of the `autogroup_init` function:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_CGROUP_SCHED
|
||||
list_add(&root_task_group.list, &task_groups);
|
||||
INIT_LIST_HEAD(&root_task_group.children);
|
||||
INIT_LIST_HEAD(&root_task_group.siblings);
|
||||
autogroup_init(&init_task);
|
||||
#endif
|
||||
```
|
||||
|
||||
which initializes automatic process group scheduling.
|
||||
|
||||
After this we are going through the all `possible` cpu (you can remember that `possible` CPUs store in the `cpu_possible_mask` bitmap of possible CPUs that can ever be available in the system) and initialize a `runqueue` for each possible cpu:
|
||||
|
||||
```C
|
||||
for_each_possible_cpu(i) {
|
||||
struct rq *rq;
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
Each processor has its own locking and individual runqueue. All runnalble tasks are stored in an active array and indexed according to its priority. When a process consumes its time slice, it is moved to an expired array. All of these arras are stored in the special structure which names is `runqueu`. As there are no global lock and runqueu, we are going through the all possible CPUs and initialize runqueue for the every cpu. The `runque` is presented by the `rq` structure in the linux kernel which defined in the [kernel/sched/sched.h](https://github.com/torvalds/linux/blob/master/kernel/sched/sched.h).
|
||||
|
||||
```C
|
||||
rq = cpu_rq(i);
|
||||
raw_spin_lock_init(&rq->lock);
|
||||
rq->nr_running = 0;
|
||||
rq->calc_load_active = 0;
|
||||
rq->calc_load_update = jiffies + LOAD_FREQ;
|
||||
init_cfs_rq(&rq->cfs);
|
||||
init_rt_rq(&rq->rt);
|
||||
init_dl_rq(&rq->dl);
|
||||
rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;
|
||||
```
|
||||
|
||||
Here we get the runque for the every CPU with the `cpu_rq` macto which returns `runqueues` percpu variable and start to initialize it with runqueu lock, number of running tasks, `calc_load` relative fields (`calc_load_active` and `calc_load_update`) which are used in the reckoning of a CPU load and initialization of the completely fair, real-time and deadline related fields in a runqueue. After this we initialize `cpu_load` array with zeros and set the last load update tick to the `jiffies` variable which determines the number of time ticks (cycles), since the system boot:
|
||||
|
||||
```C
|
||||
for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
|
||||
rq->cpu_load[j] = 0;
|
||||
|
||||
rq->last_load_update_tick = jiffies;
|
||||
```
|
||||
|
||||
where `cpu_load` keeps history of runqueue loads in the past, for now `CPU_LOAD_IDX_MAX` is 5. In the next step we fill `runqueue` fields which are related to the [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing), but we will not cover they in this part. And in the end of the loop we initialize high-resolution timer for the give `runqueue` and set the `iowait` (more about it in the separate part about scheduler) number:
|
||||
|
||||
```C
|
||||
init_rq_hrtick(rq);
|
||||
atomic_set(&rq->nr_iowait, 0);
|
||||
```
|
||||
|
||||
Now we came out from the `for_each_possible_cpu` loop and the next we need to set load weight for the `init` task with the `set_load_weight` function. Weight of process is calculated through its dynamic priority which is static priority + scheduling class of the process. After this we increase memory usage counter of the memory descriptor of the `init` process and set scheduler class for the current process:
|
||||
|
||||
```C
|
||||
atomic_inc(&init_mm.mm_count);
|
||||
current->sched_class = &fair_sched_class;
|
||||
```
|
||||
|
||||
And make current process (it will be the first `init` process) `idle` and update the value of the `calc_load_update` with the 5 seconds interval:
|
||||
|
||||
```C
|
||||
init_idle(current, smp_processor_id());
|
||||
calc_load_update = jiffies + LOAD_FREQ;
|
||||
```
|
||||
|
||||
So, the `init` process will be runned, when there will no other candidates (as it first process in the system). In the end we just set `scheduler_running` variable:
|
||||
|
||||
```C
|
||||
scheduler_running = 1;
|
||||
```
|
||||
|
||||
That's all. Linux kernel scheduler is initialized. Of course, we missed many different details and explanations here, because we need to know and understand how different concepts (like process and process groups, runqueue, rcu and etc...) works in the linux kernel , but we took a short look on the scheduler initialization process. All other details we will look in the separate part which will be fully dedicated to the scheduler.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the eighth part about the linux kernel initialization process. In this part, we looked on the initialization process of the scheduler and we will continue in the next part to dive in the linux kernel initialization process and will see initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update) and many more.
|
||||
|
||||
and other initialization stuff in the next part.
|
||||
|
||||
If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [CPU masks](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [high-resolution kernel timer](https://www.kernel.org/doc/Documentation/timers/hrtimers.txt)
|
||||
* [spinlock](http://en.wikipedia.org/wiki/Spinlock)
|
||||
* [Run queue](http://en.wikipedia.org/wiki/Run_queue)
|
||||
* [Linux kernem memory manager](http://0xax.gitbooks.io/linux-insides/content/mm/index.html)
|
||||
* [slub](http://en.wikipedia.org/wiki/SLUB_%28software%29)
|
||||
* [virtual file system](http://en.wikipedia.org/wiki/Virtual_file_system)
|
||||
* [Linux kernel hotplug documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)
|
||||
* [IRQ](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
|
||||
* [Global Descriptor Table](http://en.wikipedia.org/wiki/Global_Descriptor_Table)
|
||||
* [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing)
|
||||
* [RCU](http://en.wikipedia.org/wiki/Read-copy-update)
|
||||
* [CFS Scheduler documentation](https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt)
|
||||
* [Real-Time group scheduling](https://www.kernel.org/doc/Documentation/scheduler/sched-rt-group.txt)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-7.html)
|
@ -0,0 +1,430 @@
|
||||
Kernel initialization. Part 9.
|
||||
================================================================================
|
||||
|
||||
RCU initialization
|
||||
================================================================================
|
||||
|
||||
This is ninth part of the [Linux Kernel initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and in the previous part we stopped at the [scheduler initialization](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-8.html). In this part we will continue to dive to the linux kernel initialization process and the main purpose of this part will be to learn about initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update). We can see that the next step in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) after the `sched_init` is the call of the `preempt_disablepreempt_disable`. There are two macros:
|
||||
|
||||
* `preempt_disable`
|
||||
* `preempt_enable`
|
||||
|
||||
for preemption disabling and enabling. First of all let's try to understand what is it `preempt` in the context of an operating system kernel. In a simple words, preemption is ability of the operating system kernel to preempt current task to run task with higher priority. Here we need to disable preemption because we will have only one `init` process for the early boot time and we no need to stop it before we will call `cpu_idle` function. The `preempt_disable` macro defined in the [include/linux/preempt.h](https://github.com/torvalds/linux/blob/master/include/linux/preempt.h) and depends on the `CONFIG_PREEMPT_COUNT` kernel configuration option. This maco implemeted as:
|
||||
|
||||
```C
|
||||
#define preempt_disable() \
|
||||
do { \
|
||||
preempt_count_inc(); \
|
||||
barrier(); \
|
||||
} while (0)
|
||||
```
|
||||
|
||||
and if `CONFIG_PREEMPT_COUNT` is not set just:
|
||||
|
||||
```C
|
||||
#define preempt_disable() barrier()
|
||||
```
|
||||
|
||||
Let's look on it. First of all we can see one difference between these macro implementations. The `preempt_disable` with `CONFIG_PREEMPT_COUNT` contains the call of the `preempt_count_inc`. There is special `percpu` variable which stores the number of held locks and `preempt_disable` calls:
|
||||
|
||||
```C
|
||||
DECLARE_PER_CPU(int, __preempt_count);
|
||||
```
|
||||
|
||||
In the first implementation of the `preempt_disable` we increment this `__preempt_count`. There is API for returning value of the `__preempt_count`, it is the `preempt_count` function. As we called `preempt_disable`, first of all we increment preemption counter with the `preempt_count_inc` macro which expands to the:
|
||||
|
||||
```
|
||||
#define preempt_count_inc() preempt_count_add(1)
|
||||
#define preempt_count_add(val) __preempt_count_add(val)
|
||||
```
|
||||
|
||||
where `preempt_count_add` calls the `raw_cpu_add_4` macro which adds `1` to the given `percpu` variable (`__preempt_count`) in our case (more about `precpu` variables you can read in the part about [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)). Ok, we increased `__preempt_count` and th next step we can see the call of the `barrier` macro in the both macros. The `barrier` macro inserts an optimization barrier. In the processors with `x86_64` architecture independent memory access operations can be performed in any order. That's why we need in the oportunity to point compiler and processor on compliance of order. This mechanism is memory barrier. Let's consider simple example:
|
||||
|
||||
```C
|
||||
preempt_disable();
|
||||
foo();
|
||||
preempt_enable();
|
||||
```
|
||||
|
||||
Compiler can rearrange it as:
|
||||
|
||||
```C
|
||||
preempt_disable();
|
||||
preempt_enable();
|
||||
foo();
|
||||
```
|
||||
|
||||
In this case non-preemptible function `foo` can be preempted. As we put `barrier` macro in the `preempt_disable` and `preempt_enable` macros, it prevents the compiler from swapping `preempt_count_inc` with other statements. More about barriers you can read [here](http://en.wikipedia.org/wiki/Memory_barrier) and [here](https://www.kernel.org/doc/Documentation/memory-barriers.txt).
|
||||
|
||||
In the next step we can see following statement:
|
||||
|
||||
```C
|
||||
if (WARN(!irqs_disabled(),
|
||||
"Interrupts were enabled *very* early, fixing it\n"))
|
||||
local_irq_disable();
|
||||
```
|
||||
|
||||
which check [IRQs](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) state, and disabling (with `cli` instruction for `x86_64`) if they are enabled.
|
||||
|
||||
That's all. Preemption is disabled and we can go ahead.
|
||||
|
||||
Initialization of the integer ID management
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
In the next step we can see the call of the `idr_init_cache` function which defined in the [lib/idr.c](https://github.com/torvalds/linux/blob/master/lib/idr.c). The `idr` library used in a various [places](http://lxr.free-electrons.com/ident?i=idr_find) in the linux kernel to manage assigning integer `IDs` to objects and looking up objects by id.
|
||||
|
||||
Let's look on the implementation of the `idr_init_cache` function:
|
||||
|
||||
```C
|
||||
void __init idr_init_cache(void)
|
||||
{
|
||||
idr_layer_cache = kmem_cache_create("idr_layer_cache",
|
||||
sizeof(struct idr_layer), 0, SLAB_PANIC, NULL);
|
||||
}
|
||||
```
|
||||
|
||||
Here we can see the call of the `kmem_cache_create`. We already called the `kmem_cache_init` in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L485). This function create generalized caches again using the `kmem_cache_alloc` (more about caches we will see in the [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) chapter). In our case, as we are using `kmem_cache_t` it will be used the [slab](http://en.wikipedia.org/wiki/Slab_allocation) allocator and `kmem_cache_create` creates it. As you can seee we pass five parameters to the `kmem_cache_create`:
|
||||
|
||||
* name of the cache;
|
||||
* size of the object to store in cache;
|
||||
* offset of the first object in the page;
|
||||
* flags;
|
||||
* constructor for the objects.
|
||||
|
||||
and it will create `kmem_cache` for the integer IDs. Integer `IDs` is commonly used pattern for the to map set of integer IDs to the set of pointers. We can see usage of the integer IDs for example in the [i2c](http://en.wikipedia.org/wiki/I%C2%B2C) drivers subsystem. For example [drivers/i2c/i2c-core.c]((https://github.com/torvalds/linux/blob/master/drivers/i2c/i2c-core) which presentes the core of the `i2c` subsystem defines `ID` for the `i2c` adapter with the `DEFINE_IDR` macro:
|
||||
|
||||
```C
|
||||
static DEFINE_IDR(i2c_adapter_idr);
|
||||
```
|
||||
|
||||
and than it uses it for the declaration of the `i2c` adapter:
|
||||
|
||||
```C
|
||||
static int __i2c_add_numbered_adapter(struct i2c_adapter *adap)
|
||||
{
|
||||
int id;
|
||||
...
|
||||
...
|
||||
...
|
||||
id = idr_alloc(&i2c_adapter_idr, adap, adap->nr, adap->nr + 1, GFP_KERNEL);
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
and `id2_adapter_idr` presents dynamically calculated bus number.
|
||||
|
||||
More about integer ID management you can read [here](https://lwn.net/Articles/103209/).
|
||||
|
||||
RCU initialization
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next step is [RCU](http://en.wikipedia.org/wiki/Read-copy-update) initialization with the `rcu_init` function and it's implementation depends on two kernel configuration options:
|
||||
|
||||
* `CONFIG_TINY_RCU`
|
||||
* `CONFIG_TREE_RCU`
|
||||
|
||||
In the first case `rcu_init` will be in the [kernel/rcu/tiny.c](https://github.com/torvalds/linux/blob/master/kernel/rcu/tiny.c) and in the second case it will be defined in the [kernel/rcu/tree.c](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.c). We will see the implementation of the `tree rcu`, but first of all about the `RCU` in general.
|
||||
|
||||
`RCU` or read-copy update is a scalable high-performance synchronization mechanism implemented in the Linux kernel. On the early stage the linux kernel provided support and environment for the concurently running applications, but all execution was serialized in the kernel using a single global lock. In our days linux kernel has no single global lock, but provides different mechanisms including [lock-free data structures](http://en.wikipedia.org/wiki/Concurrent_data_structure), [percpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) data structures and other. One of these mechanisms is - the `read-copy update`. The `RCU` technique designed for rarely-modified data structures. The idea of the `RCU` is simple. For example we have a rarely-modified data structure. If somebody wants to change this data structure, we make a copy of this data structure and make all changes in the copy. In the same time all other users of the data structure use old version of it. Next, we need to choose safe moment when original version of the data structure will have no users and update it with the modified copy.
|
||||
|
||||
Of course this description of the `RCU` is very simplified. To understand some details about `RCU`, first of all we need to learn some terminology. Data readers in the `RCU` executed in the [critical section](http://en.wikipedia.org/wiki/Critical_section). Everytime when data reader joins to the critical section, it calls the `rcu_read_lock`, and `rcu_read_unlock` on exit from the critical section. If the thread is not in the critical section, it will be in state which called - `quiescent state`. Every moment when every thread was in the `quiescent state` called - `grace period`. If a thread wants to remove element from the data structure, this occurs in two steps. First steps is `removal` - atomically removes element from the data structure, but does not release the physical memory. After this thread-writer announces and waits while it will be finsihed. From this moment, the removed element is available to the thread-readers. After the `grace perioud` will be finished, the second step of the element removal will be started, it just removes element from the physical memory.
|
||||
|
||||
There a couple implementations of the `RCU`. Old `RCU` called classic, the new implemetation called `tree` RCU. As you already can undrestand, the `CONFIG_TREE_RCU` kernel configuration option enables tree `RCU`. Another is the `tiny` RCU which depends on `CONFIG_TINY_RCU` and `CONFIG_SMP=n`. We will see more details about the `RCU` in general in the separate chapter about synchronization primitives, but now let's look on the `rcu_init` implementation from the [kernel/rcu/tree.c](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.c):
|
||||
|
||||
```C
|
||||
void __init rcu_init(void)
|
||||
{
|
||||
int cpu;
|
||||
|
||||
rcu_bootup_announce();
|
||||
rcu_init_geometry();
|
||||
rcu_init_one(&rcu_bh_state, &rcu_bh_data);
|
||||
rcu_init_one(&rcu_sched_state, &rcu_sched_data);
|
||||
__rcu_init_preempt();
|
||||
open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);
|
||||
|
||||
/*
|
||||
* We don't need protection against CPU-hotplug here because
|
||||
* this is called early in boot, before either interrupts
|
||||
* or the scheduler are operational.
|
||||
*/
|
||||
cpu_notifier(rcu_cpu_notify, 0);
|
||||
pm_notifier(rcu_pm_notify, 0);
|
||||
for_each_online_cpu(cpu)
|
||||
rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu);
|
||||
|
||||
rcu_early_boot_tests();
|
||||
}
|
||||
```
|
||||
|
||||
In the beginning of the `rcu_init` function we define `cpu` variable and call `rcu_bootup_announce`. The `rcu_bootup_announce` function is pretty simple:
|
||||
|
||||
```C
|
||||
static void __init rcu_bootup_announce(void)
|
||||
{
|
||||
pr_info("Hierarchical RCU implementation.\n");
|
||||
rcu_bootup_announce_oddness();
|
||||
}
|
||||
```
|
||||
|
||||
It just prints information about the `RCU` with the `pr_info` function and `rcu_bootup_announce_oddness` which uses `pr_info` too, for printing different information about the current `RCU` configuration which depends on different kernel configuration options like `CONFIG_RCU_TRACE`, `CONFIG_PROVE_RCU`, `CONFIG_RCU_FANOUT_EXACT` and etc... In the next step, we can see the call of the `rcu_init_geometry` function. This function defined in the same source code file and computes the node tree geometry depends on amount of CPUs. Actually `RCU` provides scalability with extremely low internal to RCU lock contention. What if a data structure will be read from the different CPUs? `RCU` API provides the `rcu_state` structure wihch presents RCU global state including node hierarchy. Hierachy presented by the:
|
||||
|
||||
```
|
||||
struct rcu_node node[NUM_RCU_NODES];
|
||||
```
|
||||
|
||||
array of structures. As we can read in the comment which is above definition of this structure:
|
||||
|
||||
```
|
||||
The root (first level) of the hierarchy is in ->node[0] (referenced by ->level[0]), the second
|
||||
level in ->node[1] through ->node[m] (->node[1] referenced by ->level[1]), and the third level
|
||||
in ->node[m+1] and following (->node[m+1] referenced by ->level[2]). The number of levels is
|
||||
determined by the number of CPUs and by CONFIG_RCU_FANOUT.
|
||||
|
||||
Small systems will have a "hierarchy" consisting of a single rcu_node.
|
||||
```
|
||||
|
||||
The `rcu_node` structure defined in the [kernel/rcu/tree.h](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.h) and contains information about current grace period, is grace period completed or not, CPUs or groups that need to switch in order for current grace period to proceed and etc... Every `rcu_node` contains a lock for a couple of CPUs. These `rcu_node` structures embedded into a linear array in the `rcu_state` structure and represeted as a tree with the root in the zero element and it covers all CPUs. As you can see the number of the rcu nodes determined by the `NUM_RCU_NODES` which depends on number of available CPUs:
|
||||
|
||||
```C
|
||||
#define NUM_RCU_NODES (RCU_SUM - NR_CPUS)
|
||||
#define RCU_SUM (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3 + NUM_RCU_LVL_4)
|
||||
```
|
||||
|
||||
where levels values depend on the `CONFIG_RCU_FANOUT_LEAF` configuration option. For example for the simplest case, one `rcu_node` will cover two CPU on machine with the eight CPUs:
|
||||
|
||||
```
|
||||
+-----------------------------------------------------------------+
|
||||
| rcu_state |
|
||||
| +----------------------+ |
|
||||
| | root | |
|
||||
| | rcu_node | |
|
||||
| +----------------------+ |
|
||||
| | | |
|
||||
| +----v-----+ +--v-------+ |
|
||||
| | | | | |
|
||||
| | rcu_node | | rcu_node | |
|
||||
| | | | | |
|
||||
| +------------------+ +----------------+ |
|
||||
| | | | | |
|
||||
| | | | | |
|
||||
| +----v-----+ +-------v--+ +-v--------+ +-v--------+ |
|
||||
| | | | | | | | | |
|
||||
| | rcu_node | | rcu_node | | rcu_node | | rcu_node | |
|
||||
| | | | | | | | | |
|
||||
| +----------+ +----------+ +----------+ +----------+ |
|
||||
| | | | | |
|
||||
| | | | | |
|
||||
| | | | | |
|
||||
| | | | | |
|
||||
+---------|-----------------|-------------|---------------|-------+
|
||||
| | | |
|
||||
+---------v-----------------v-------------v---------------v--------+
|
||||
| | | | |
|
||||
| CPU1 | CPU3 | CPU5 | CPU7 |
|
||||
| | | | |
|
||||
| CPU2 | CPU4 | CPU6 | CPU8 |
|
||||
| | | | |
|
||||
+------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
So, in the `rcu_init_geometry` function we just need to calculate the total number of `rcu_node` structures. We start to do it with the calculation of the `jiffies` till to the first and next `fqs` which is `force-quiescent-state` (read above about it):
|
||||
|
||||
```C
|
||||
d = RCU_JIFFIES_TILL_FORCE_QS + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
|
||||
if (jiffies_till_first_fqs == ULONG_MAX)
|
||||
jiffies_till_first_fqs = d;
|
||||
if (jiffies_till_next_fqs == ULONG_MAX)
|
||||
jiffies_till_next_fqs = d;
|
||||
```
|
||||
|
||||
where:
|
||||
|
||||
```C
|
||||
#define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
|
||||
#define RCU_JIFFIES_FQS_DIV 256
|
||||
```
|
||||
|
||||
As we calculated these [jiffies](http://en.wikipedia.org/wiki/Jiffy_%28time%29), we check that previous defined `jiffies_till_first_fqs` and `jiffies_till_next_fqs` variables are equal to the [ULONG_MAX](http://www.rowleydownload.co.uk/avr/documentation/index.htm?http://www.rowleydownload.co.uk/avr/documentation/ULONG_MAX.htm) (their default values) and set they equal to the calculated value. As we did not touch these variables before, they are equal to the `ULONG_MAX`:
|
||||
|
||||
```C
|
||||
static ulong jiffies_till_first_fqs = ULONG_MAX;
|
||||
static ulong jiffies_till_next_fqs = ULONG_MAX;
|
||||
```
|
||||
|
||||
In the next step of the `rcu_init_geometry`, we check that `rcu_fanout_leaf` didn't chage (it has the same value as `CONFIG_RCU_FANOUT_LEAF` in compile-time) and equal to the value of the `CONFIG_RCU_FANOUT_LEAF` configuration option, we just return:
|
||||
|
||||
```C
|
||||
if (rcu_fanout_leaf == CONFIG_RCU_FANOUT_LEAF &&
|
||||
nr_cpu_ids == NR_CPUS)
|
||||
return;
|
||||
```
|
||||
|
||||
After this we need to compute the number of nodes that can be handled an `rcu_node` tree with the given number of levels:
|
||||
|
||||
```C
|
||||
rcu_capacity[0] = 1;
|
||||
rcu_capacity[1] = rcu_fanout_leaf;
|
||||
for (i = 2; i <= MAX_RCU_LVLS; i++)
|
||||
rcu_capacity[i] = rcu_capacity[i - 1] * CONFIG_RCU_FANOUT;
|
||||
```
|
||||
|
||||
And in the last step we calcluate the number of rcu_nodes at each level of the tree in the [loop](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.c#L4094).
|
||||
|
||||
As we calculated geometry of the `rcu_node` tree, we need to back to the `rcu_init` function and next step we need to initialize two `rcu_state` structures with the `rcu_init_one` function:
|
||||
|
||||
```C
|
||||
rcu_init_one(&rcu_bh_state, &rcu_bh_data);
|
||||
rcu_init_one(&rcu_sched_state, &rcu_sched_data);
|
||||
```
|
||||
|
||||
The `rcu_init_one` function takes two arguments:
|
||||
|
||||
* Global `RCU` state;
|
||||
* Per-CPU data for `RCU`.
|
||||
|
||||
Both variables defined in the [kernel/rcu/tree.h](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.h) with its `percpu` data:
|
||||
|
||||
```
|
||||
extern struct rcu_state rcu_bh_state;
|
||||
DECLARE_PER_CPU(struct rcu_data, rcu_bh_data);
|
||||
```
|
||||
|
||||
About this states you can read [here](http://lwn.net/Articles/264090/). As I wrote above we need to initialize `rcu_state` structures and `rcu_init_one` function will help us with it. After the `rcu_state` initialization, we can see the call of the ` __rcu_init_preempt` which depends on the `CONFIG_PREEMPT_RCU` kernel configuration option. It does the same that previous functions - initialization of the `rcu_preempt_state` structure with the `rcu_init_one` function which has `rcu_state` type. After this, in the `rcu_init`, we can see the call of the:
|
||||
|
||||
```C
|
||||
open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);
|
||||
```
|
||||
|
||||
function. This function registers a handler of the `pending interrupt`. Pending interrupt or `softirq` supposes that part of actions cab be delayed for later execution when the system will be less loaded. Pending interrupts represeted by the following structure:
|
||||
|
||||
```C
|
||||
struct softirq_action
|
||||
{
|
||||
void (*action)(struct softirq_action *);
|
||||
};
|
||||
```
|
||||
|
||||
which defined in the [include/linux/interrupt.h](https://github.com/torvalds/linux/blob/master/include/linux/interrupt.h) and contains only one field - handler of an interrupt. You can know about `softirqs` in the your system with the:
|
||||
|
||||
```
|
||||
$ cat /proc/softirqs
|
||||
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
|
||||
HI: 2 0 0 1 0 2 0 0
|
||||
TIMER: 137779 108110 139573 107647 107408 114972 99653 98665
|
||||
NET_TX: 1127 0 4 0 1 1 0 0
|
||||
NET_RX: 334 221 132939 3076 451 361 292 303
|
||||
BLOCK: 5253 5596 8 779 2016 37442 28 2855
|
||||
BLOCK_IOPOLL: 0 0 0 0 0 0 0 0
|
||||
TASKLET: 66 0 2916 113 0 24 26708 0
|
||||
SCHED: 102350 75950 91705 75356 75323 82627 69279 69914
|
||||
HRTIMER: 510 302 368 260 219 255 248 246
|
||||
RCU: 81290 68062 82979 69015 68390 69385 63304 63473
|
||||
```
|
||||
|
||||
The `open_softirq` function takes two parameters:
|
||||
|
||||
* index of the interrupt;
|
||||
* interrupt handler.
|
||||
|
||||
and adds interrupt handler to the array of the pending interrupts:
|
||||
|
||||
```C
|
||||
void open_softirq(int nr, void (*action)(struct softirq_action *))
|
||||
{
|
||||
softirq_vec[nr].action = action;
|
||||
}
|
||||
```
|
||||
|
||||
In our case the interrupt handler is - `rcu_process_callbacks` which defined in the [kernel/rcu/tree.c](https://github.com/torvalds/linux/blob/master/kernel/rcu/tree.c) and does the `RCU` core processing for the current CPU. After we registered `softirq` interrupt for the `RCU`, we can see the following code:
|
||||
|
||||
```C
|
||||
cpu_notifier(rcu_cpu_notify, 0);
|
||||
pm_notifier(rcu_pm_notify, 0);
|
||||
for_each_online_cpu(cpu)
|
||||
rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu);
|
||||
```
|
||||
|
||||
Here we can see registration of the `cpu` notifier which needs in sysmtems which supports [CPU hotplug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt) and we will not dive into details about this theme. The last function in the `rcu_init` is the `rcu_early_boot_tests`:
|
||||
|
||||
```C
|
||||
void rcu_early_boot_tests(void)
|
||||
{
|
||||
pr_info("Running RCU self tests\n");
|
||||
|
||||
if (rcu_self_test)
|
||||
early_boot_test_call_rcu();
|
||||
if (rcu_self_test_bh)
|
||||
early_boot_test_call_rcu_bh();
|
||||
if (rcu_self_test_sched)
|
||||
early_boot_test_call_rcu_sched();
|
||||
}
|
||||
```
|
||||
|
||||
which runs self tests for the `RCU`.
|
||||
|
||||
That's all. We saw initialization process of the `RCU` subsystem. As I wrote above, more about the `RCU` will be in the separate chapter about synchronization primitives.
|
||||
|
||||
Rest of the initialization process
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Ok, we already passed the main theme of this part which is `RCU` initialization, but it is not the end of the linux kernel initialization process. In the last paragraph of this theme we will see a couple of functions which work in the initialization time, but we will not dive into deep details around this function by different reasons. Some reasons not to dive into details are following:
|
||||
|
||||
* They are not very important for the generic kernel initialization process and can depend on the different kernel configuration;
|
||||
* They have the character of debugging and not important too for now;
|
||||
* We will see many of this stuff in the separate parts/chapters.
|
||||
|
||||
After we initilized `RCU`, the next step which you can see in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) is the - `trace_init` function. As you can understand from its name, this function initialize [tracing](http://en.wikipedia.org/wiki/Tracing_%28software%29) subsystem. More about linux kernel trace system you can read - [here](http://elinux.org/Kernel_Trace_Systems).
|
||||
|
||||
After the `trace_init`, we can see the call of the `radix_tree_init`. If you are familar with the different data structures, you can understand from the name of this function that it initializes kernel implementation of the [Radix tree](http://en.wikipedia.org/wiki/Radix_tree). This function defined in the [lib/radix-tree.c](https://github.com/torvalds/linux/blob/master/lib/radix-tree.c) and more about it you can read in the part about [Radix tree](http://0xax.gitbooks.io/linux-insides/content/DataStructures/radix-tree.md).
|
||||
|
||||
In the next step we can see the functions which are related to the `interrupts handling` subsystem, they are:
|
||||
|
||||
* `early_irq_init`
|
||||
* `init_IRQ`
|
||||
* `softirq_init`
|
||||
|
||||
We will see explanation about this functions and their implementation in the special part about interrupts and exceptions handling. After this many different functions (like `init_timers`, `hrtimers_init`, `time_init` and etc...) which are related to different timing and timers stuff. More about these function we will see in the chapter about timers.
|
||||
|
||||
The next couple of functions related with the [perf](https://perf.wiki.kernel.org/index.php/Main_Page) events - `perf_event-init` (will be separate chapter about perf), initialization of the `profiling` with the `profile_init`. After this we enable `irq` with the call of the:
|
||||
|
||||
```C
|
||||
local_irq_enable();
|
||||
```
|
||||
|
||||
which expands to the `sti` instruction and making post initialization of the [SLAB](http://en.wikipedia.org/wiki/Slab_allocation) with the call of the `kmem_cache_init_late` function (As I wrote above we will know about the `SLAB` in the [Linux memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) chapter).
|
||||
|
||||
After the post initialization of the `SLAB`, next point is initialization of the console with the `console_init` function from the [drivers/tty/tty_io.c](https://github.com/torvalds/linux/blob/master/drivers/tty/tty_io.c).
|
||||
|
||||
After the console initialization, we can see the `lockdep_info` function which prints information about the [Lock dependency validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt). After this, we can see the initialization of the dynamic allocation of the `debug objects` with the `debug_objects_mem_init`, kernel memory leack [detector](https://www.kernel.org/doc/Documentation/kmemleak.txt) initialization with the `kmemleak_init`, `percpu` pageset setup with the `setup_per_cpu_pageset`, setup of the [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) policy with the `numa_policy_init`, setting time for the scheduler with the `sched_clock_init`, `pidmap` initialization with the call of the `pidmap_init` function for the initial `PID` namespace, cache creation with the `anon_vma_init` for the private virtual memory areas and early initialization of the [ACPI](http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface) with the `acpi_early_init`.
|
||||
|
||||
This is the end of the ninth part of the [linux kernel initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) and here we saw initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update). In the last paragraph of this part (`Rest of the initialization process`) we went thorugh the many functions but did not dive into details about their implementations. Do not worry if you do not know anything about these stuff or you know and do not understand anything about this. As I wrote already many times, we will see details of implementations, but in the other parts or other chapters.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the ninth part about the linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html). In this part, we looked on the initialization process of the `RCU` subsystem. In the next part we will continue to dive into linux kernel initialization process and I hope that we will finish with the `start_kernel` function and will go to the `rest_init` function from the same [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file and will see that start of the first process.
|
||||
|
||||
If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [lock-free data structures](http://en.wikipedia.org/wiki/Concurrent_data_structure)
|
||||
* [kmemleak](https://www.kernel.org/doc/Documentation/kmemleak.txt)
|
||||
* [ACPI](http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface)
|
||||
* [IRQs](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
|
||||
* [RCU](http://en.wikipedia.org/wiki/Read-copy-update)
|
||||
* [RCU documentation](https://github.com/torvalds/linux/tree/master/Documentation/RCU)
|
||||
* [integer ID management](https://lwn.net/Articles/103209/)
|
||||
* [Documentation/memory-barriers.txt](https://www.kernel.org/doc/Documentation/memory-barriers.txt)
|
||||
* [Runtime locking correctness validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt)
|
||||
* [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html)
|
||||
* [slab](http://en.wikipedia.org/wiki/Slab_allocation)
|
||||
* [i2c](http://en.wikipedia.org/wiki/I%C2%B2C)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-8.html)
|
@ -0,0 +1,9 @@
|
||||
# Interrupts and Interrupt Handling
|
||||
|
||||
You will find a couple of posts which describes an interrupts and an exceptions handling in the linux kernel.
|
||||
|
||||
* [Interrupts and Interrupt Handling. Part 1.](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-1.md) - describes an interrupts handling theory.
|
||||
* [Start to dive into interrupts in the Linux kernel](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-2.md) - this part starts to describe interrupts and exceptions handling related stuff from the early stage.
|
||||
* [Early interrupt handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-3.md) - third part describes early interrupt handlers.
|
||||
* [Interrupt handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-4.md) - fourth part describes first non-early interrupt handlers.
|
||||
* [Implementation of exception handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-5.md) - descripbes implementation of some exception handlers as double fault, divide by zero and etc.
|
@ -0,0 +1,547 @@
|
||||
Interrupts and Interrupt Handling. Part 2.
|
||||
================================================================================
|
||||
|
||||
Start to dive into interrupt and exceptions handling in the Linux kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
We saw some theory about an interrupts and an exceptions handling in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) and as I already wrote in that part, we will start to dive into interrupts and exceptions in the Linux kernel source code in this part. As you already can note, the previous part mostly described theoretical aspects and since this part we will start to dive directly into the Linux kernel source code. We will start to do it as we did it in other chapters, from the very early places. We will not see the Linux kernel source code from the earliest [code lines](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L292) as we saw it for example in the [Linux kernel booting process](http://0xax.gitbooks.io/linux-insides/content/Booting/index.html) chapter, but we will start from the earliest code which is related to the interrupts and exceptions. Since this part we will try to go through the all interrupts and exceptions related stuff which we can find in the Linux kernel source code.
|
||||
|
||||
If you've read the previous parts, you can remember that the earliest place in the Linux kernel `x86_64` architecture-specifix source code which is related to the interrupt is located in the [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pm.c) source code file and represents the first setup of the [Interrupt Descriptor Table](http://en.wikipedia.org/wiki/Interrupt_descriptor_table). It occurs right before the transition into the [protected mode](http://en.wikipedia.org/wiki/Protected_mode) in the `go_to_protected_mode` function by the call of the `setup_idt`:
|
||||
|
||||
```C
|
||||
void go_to_protected_mode(void)
|
||||
{
|
||||
...
|
||||
setup_idt();
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
The `setup_idt` function defined in the same source code file as the `go_to_protected_mode` function and just loads address of the `NULL` interrupts descriptor table:
|
||||
|
||||
```C
|
||||
static void setup_idt(void)
|
||||
{
|
||||
static const struct gdt_ptr null_idt = {0, 0};
|
||||
asm volatile("lidtl %0" : : "m" (null_idt));
|
||||
}
|
||||
```
|
||||
|
||||
where `gdt_ptr` represents special 48-bit `GTDR` register which must contain base address of the `Global Descriptor Table`:
|
||||
|
||||
```C
|
||||
struct gdt_ptr {
|
||||
u16 len;
|
||||
u32 ptr;
|
||||
} __attribute__((packed));
|
||||
```
|
||||
|
||||
Of course in our case the `gdt_ptr` does not represent `GDTR` register, but `IDTR` since we set `Interrupt Descriptor Table`. You will not find `idt_ptr` structure, because if it had been in the Linux kernel source code, it would have been the same as `gdt_ptr` but with different name. So, as you can understand there is no sense to have two similar structures which are differ only in a name. You can note here, that we do not fill the `Interrupt Descriptor Table` with entries, because it is too early to handle any interrupts or exceptions for this moment. That's why we just fill the `IDT` with the `NULL`.
|
||||
|
||||
And after the setup of the [Interrupt descriptor table](http://en.wikipedia.org/wiki/Interrupt_descriptor_table), [Global Descriptor Table](http://en.wikipedia.org/wiki/GDT) and other stuff we jump into [protected mode](http://en.wikipedia.org/wiki/Protected_mode) in the - [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pmjump.S). More about it you can read in the [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-3.html) which describes transition to the protected mode.
|
||||
|
||||
We already know from the earliest parts that entry of the protected mode located in the `boot_params.hdr.code32_start` and you can see that we pass the entry of the protected mode and `boot_params` to the `protected_mode_jump` in the end of the [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pm.c):
|
||||
|
||||
```C
|
||||
protected_mode_jump(boot_params.hdr.code32_start,
|
||||
(u32)&boot_params + (ds() << 4));
|
||||
```
|
||||
|
||||
The `protected_mode_jump` defined in the [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pmjump.S) and gets these two parameters in the `ax` and `dx` registers using one of the [8086](http://en.wikipedia.org/wiki/Intel_8086) calling [convention](http://en.wikipedia.org/wiki/X86_calling_conventions#List_of_x86_calling_conventions):
|
||||
|
||||
```assembly
|
||||
GLOBAL(protected_mode_jump)
|
||||
...
|
||||
...
|
||||
...
|
||||
.byte 0x66, 0xea # ljmpl opcode
|
||||
2: .long in_pm32 # offset
|
||||
.word __BOOT_CS # segment
|
||||
...
|
||||
...
|
||||
...
|
||||
ENDPROC(protected_mode_jump)
|
||||
```
|
||||
|
||||
where `in_pm32` contains jump to the 32-bit entrypoint:
|
||||
|
||||
```assembly
|
||||
GLOBAL(in_pm32)
|
||||
...
|
||||
...
|
||||
jmpl *%eax // %eax contains address of the `startup_32`
|
||||
...
|
||||
...
|
||||
ENDPROC(in_pm32)
|
||||
```
|
||||
|
||||
As you can remember 32-bit entry point is in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) assembly file, although it contains `_64` in the its name. We can see the two similar files in the `arch/x86/boot/compressed` directory:
|
||||
|
||||
* `arch/x86/boot/compressed/head_32.S`.
|
||||
* `arch/x86/boot/compressed/head_64.S`;
|
||||
|
||||
But the 32-bit mode entry point the the second file in our case. The first file even not compiled for `x86_64`. Let's look on the [arch/x86/boot/compressed/Makefile](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/Makefile):
|
||||
|
||||
```
|
||||
vmlinux-objs-y := $(obj)/vmlinux.lds $(obj)/head_$(BITS).o $(obj)/misc.o \
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
We can see here that `head_*` depends on the `$(BITS)` variable which depends on the architecture. You can find it in the [arch/x86/Makefile](https://github.com/torvalds/linux/blob/master/arch/x86/Makefile):
|
||||
|
||||
```
|
||||
ifeq ($(CONFIG_X86_32),y)
|
||||
...
|
||||
BITS := 32
|
||||
else
|
||||
BITS := 64
|
||||
...
|
||||
endif
|
||||
```
|
||||
|
||||
Now as we jumped on the `startup_32` from the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) we will not find anything related to the interrupt handling here. The `startup_32` contains code that makes preparations before transition into the [long mode](http://en.wikipedia.org/wiki/Long_mode) and directly jumps in it. The `long mode` entry located `startup_64` and it makes preparation before the [kernel decompression](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) that occurs in the `decompress_kernel` from the [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c). After kernel decompressed, we jump on the `startup_64` from the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S). In the `startup_64` we start to build identity-mapped pages. After we have built identity-mapped pages, checked [NX](http://en.wikipedia.org/wiki/NX_bit) bit, made setup of the `Extended Feature Enable Register` (see in links), updated early `Global Descriptor Table` wit the `lgdt` instruction, we need to setup `gs` register with the following code:
|
||||
|
||||
```assembly
|
||||
movl $MSR_GS_BASE,%ecx
|
||||
movl initial_gs(%rip),%eax
|
||||
movl initial_gs+4(%rip),%edx
|
||||
wrmsr
|
||||
```
|
||||
|
||||
We already saw this code in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) and not time to know better what is going on here. First of all pay attention on the last `wrmsr` instruction. This instruction writes data from the `edx:eax` registers to the [model specific register](http://en.wikipedia.org/wiki/Model-specific_register) specified by the `ecx` register. We can see that `ecx` contains `$MSR_GS_BASE` which declared in the [arch/x86/include/uapi/asm/msr-index.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/msr-index.h) and looks:
|
||||
|
||||
```C
|
||||
#define MSR_GS_BASE 0xc0000101
|
||||
```
|
||||
|
||||
From this we can understand that `MSR_GS_BASE` defines number of the `model specific register`. Since registers `cs`, `ds`, `es`, and `ss` are not used in the 64-bit mode, their fields are ignored. But we can access memory over `fs` and `gs` registers. The model specific register provides `back door` to the hidden parts of these segment registers and allows to use 64-bit base address for segment register addressed by the `fs` and `gs`. So the `MSR_GS_BASE` is the hidden part and this part is mapped on the `GS.base` field. Let's look on the `initial_gs`:
|
||||
|
||||
```assembly
|
||||
GLOBAL(initial_gs)
|
||||
.quad INIT_PER_CPU_VAR(irq_stack_union)
|
||||
```
|
||||
|
||||
We pass `irq_stack_union` symbol to the `INIT_PER_CPU_VAR` macro which just concatenates `init_per_cpu__` prefix with the given symbol. In our case we will get `init_per_cpu__irq_stack_union` symbol. Let's look on the [linker](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S) script. There we can see following definition:
|
||||
|
||||
```
|
||||
#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load
|
||||
INIT_PER_CPU(irq_stack_union);
|
||||
```
|
||||
|
||||
It tells us that address of the `init_per_cpu__irq_stack_union` will be `irq_stack_union + __per_cpu_load`. Now we need to understand where are `init_per_cpu__irq_stack_union` and `__per_cpu_load` and what they mean. The first `irq_stack_union` defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h) with the `DECLARE_INIT_PER_CPU` macro which expands to call of the `init_per_cpu_var` macro:
|
||||
|
||||
```C
|
||||
DECLARE_INIT_PER_CPU(irq_stack_union);
|
||||
|
||||
#define DECLARE_INIT_PER_CPU(var) \
|
||||
extern typeof(per_cpu_var(var)) init_per_cpu_var(var)
|
||||
|
||||
#define init_per_cpu_var(var) init_per_cpu__##var
|
||||
```
|
||||
|
||||
If we will expand all macro we will get the same `init_per_cpu__irq_stack_union` as we got after expanding of the `INIT_PER_CPU` macro, but you can note that it is already not just symbol, but variable. Let's look on the `typeof(percpu_var(var))` expression. Our `var` is `irq_stack_union` and `per_cpu_var` macro defined in the [arch/x86/include/asm/percpu.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/percpu.h):
|
||||
|
||||
```C
|
||||
#define PER_CPU_VAR(var) %__percpu_seg:var
|
||||
```
|
||||
|
||||
where:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_X86_64
|
||||
#define __percpu_seg gs
|
||||
endif
|
||||
```
|
||||
|
||||
So, we accessing `gs:irq_stack_union` and geting its type which is `irq_union`. Ok, we defined the first variable and know its address, now let's look on the second `__per_cpu_load` symbol. There are a couple of percpu variables which are located after this symbol. The `__per_cpu_load` defined in the [include/asm-generic/sections.h](https://github.com/torvalds/linux/blob/master/include/asm-generic-sections.h):
|
||||
|
||||
```C
|
||||
extern char __per_cpu_load[], __per_cpu_start[], __per_cpu_end[];
|
||||
```
|
||||
|
||||
and presented base address of the `per-cpu` variables from the data area. So, we know address of the `irq_stack_union`, `__per_cpu_load` and we know that `init_per_cpu__irq_stack_union` must be placed right after `__per_cpu_load`. And we can see it in the [System.map](http://en.wikipedia.org/wiki/System.map):
|
||||
|
||||
```
|
||||
...
|
||||
...
|
||||
...
|
||||
ffffffff819ed000 D __init_begin
|
||||
ffffffff819ed000 D __per_cpu_load
|
||||
ffffffff819ed000 A init_per_cpu__irq_stack_union
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
Now we know about `initia_gs`, so let's book to the our code:
|
||||
|
||||
```assembly
|
||||
movl $MSR_GS_BASE,%ecx
|
||||
movl initial_gs(%rip),%eax
|
||||
movl initial_gs+4(%rip),%edx
|
||||
wrmsr
|
||||
```
|
||||
|
||||
Here we specified model specific register with `MSR_GS_BASE`, put 64-bit address of the `initial_gs` to the `edx:eax` pair and execute `wrmsr` instruction for filling the `gs` register with base address of the `init_per_cpu__irq_stack_union` which will be bottom of the interrupt stack. After this we will jump to the C code on the `x86_64_start_kernel` from the [arch/x86/kernel/head64.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head64.c). In the `x86_64_start_kernel` function we do the last preparations before we jump into the generic and architecture-independent kernel code and on of these preparations is filling of the early `Interrupt Descriptor Table` with the interrupts handlres entries or `early_idt_handlers`. You can remember it, if you have read the part about the [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) and can remember following code:
|
||||
|
||||
```C
|
||||
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
|
||||
set_intr_gate(i, early_idt_handlers[i]);
|
||||
|
||||
load_idt((const struct desc_ptr *)&idt_descr);
|
||||
```
|
||||
|
||||
but I wrote `Early interrupt and exception handling` part when Linux kernel version was - `3.18`. For this day actual version of the Linux kernel is `4.1.0-rc6+` and ` Andy Lutomirski` sent the [patch](https://lkml.org/lkml/2015/6/2/106) and soon it will be in the mainline kernel that changes behaviour for the `early_idt_handlers`. **NOTE** While I wrote this part the [patch](https://github.com/torvalds/linux/commit/425be5679fd292a3c36cb1fe423086708a99f11a) already turned in the Linux kernel source code. Let's look on it. Now the same part looks like:
|
||||
|
||||
```C
|
||||
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
|
||||
set_intr_gate(i, early_idt_handler_array[i]);
|
||||
|
||||
load_idt((const struct desc_ptr *)&idt_descr);
|
||||
```
|
||||
|
||||
AS you can see it has only one difference in the name of the array of the interrupts handlers entry points. Now it is `early_idt_handler_arry`:
|
||||
|
||||
```C
|
||||
extern const char early_idt_handler_array[NUM_EXCEPTION_VECTORS][EARLY_IDT_HANDLER_SIZE];
|
||||
```
|
||||
|
||||
where `NUM_EXCEPTION_VECTORS` and `EARLY_IDT_HANDLER_SIZE` are defined as:
|
||||
|
||||
```C
|
||||
#define NUM_EXCEPTION_VECTORS 32
|
||||
#define EARLY_IDT_HANDLER_SIZE 9
|
||||
```
|
||||
|
||||
So, the `early_idt_handler_array` is an array of the interrupts handlers entry points and contains one entry point on every nine bytes. You can remember that previous `early_idt_handlers` was defined in the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S). The `early_idt_handler_array` is defined in the same source code file too:
|
||||
|
||||
```assembly
|
||||
ENTRY(early_idt_handler_array)
|
||||
...
|
||||
...
|
||||
...
|
||||
ENDPROC(early_idt_handler_common)
|
||||
```
|
||||
|
||||
It fills `early_idt_handler_arry` with the `.rept NUM_EXCEPTION_VECTORS` and contains entry of the `early_make_pgtable` interrupt handler (more about its implementation you can read in the part about [Early interrupt and exception handling](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html)). For now we come to the end of the `x86_64` architecture-specific code and the next part is the generic kernel code. Of course you already can know that we will return to the architecture-specific code in the `setup_arch` function and other places, but this is the end of the `x86_64` early code.
|
||||
|
||||
Setting stack canary for the interrupt stack
|
||||
-------------------------------------------------------------------------------
|
||||
|
||||
The next stop after the [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S) is the biggest `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c). If you've read the previous [chapter](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html) about the Linux kernel initialization process, you must remember it. This function does all initialization stuff before kernel will launch first `init` process with the [pid](https://en.wikipedia.org/wiki/Process_identifier) - `1`. The first thing that is related to the interrupts and exceptions handling is the call of the `boot_init_stack_canary` function.
|
||||
|
||||
This function sets the [canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow#Stack_canaries) value to protect interrupt stack overflow. We already saw a little some details about implementation of the `boot_init_stack_canary` in the previous part and now let's take a closer look on it. You can find implementation of this function in the [arch/x86/include/asm/stackprotector.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/stackprotector.h) and its depends on the `CONFIG_CC_STACKPROTECTOR` kernel configuration option. If this option is not set this function will not do anything:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_CC_STACKPROTECTOR
|
||||
...
|
||||
...
|
||||
...
|
||||
#else
|
||||
static inline void boot_init_stack_canary(void)
|
||||
{
|
||||
}
|
||||
#endif
|
||||
```
|
||||
|
||||
If the `CONFIG_CC_STACKPROTECTOR` kernel configuration option is set, the `boot_init_stack_canary` function starts from the check stat `irq_stack_union` that represents [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) interrupt stack has offset equal to forty bytes from the `stack_canary` value:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_X86_64
|
||||
BUILD_BUG_ON(offsetof(union irq_stack_union, stack_canary) != 40);
|
||||
#endif
|
||||
```
|
||||
|
||||
As we can read in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) the `irq_stack_union` represented by the following union:
|
||||
|
||||
```C
|
||||
union irq_stack_union {
|
||||
char irq_stack[IRQ_STACK_SIZE];
|
||||
|
||||
struct {
|
||||
char gs_base[40];
|
||||
unsigned long stack_canary;
|
||||
};
|
||||
};
|
||||
```
|
||||
|
||||
which defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h). We know that [unioun](http://en.wikipedia.org/wiki/Union_type) in the [C](http://en.wikipedia.org/wiki/C_%28programming_language%29) programming language is a data structure which stores only one field in a memory. We can see here that structure has first field - `gs_base` which is 40 bytes size and represents bottom of the `irq_stack`. So, after this our check with the `BUILD_BUG_ON` macro should end successfully. (you can read the first part about Linux kernel initialization [process](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) if you're interesting about the `BUILD_BUG_ON` macro).
|
||||
|
||||
After this we calculate new `canary` value based on the random number and [Time Stamp Counter](http://en.wikipedia.org/wiki/Time_Stamp_Counter):
|
||||
|
||||
```C
|
||||
get_random_bytes(&canary, sizeof(canary));
|
||||
tsc = __native_read_tsc();
|
||||
canary += tsc + (tsc << 32UL);
|
||||
```
|
||||
|
||||
and write `canary` value to the `irq_stack_union` with the `this_cpu_write` macro:
|
||||
|
||||
```C
|
||||
this_cpu_write(irq_stack_union.stack_canary, canary);
|
||||
```
|
||||
|
||||
more about `this_cpu_*` operation you can read in the [Linux kernel documentation](https://github.com/torvalds/linux/blob/master/Documentation/this_cpu_ops.txt).
|
||||
|
||||
Disabling/Enabling local interrupts
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next step in the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) which is related to the interrupts and interrupts handling after we have set the `canary` value to the interrupt stack - is the call of the `local_irq_disable` macro.
|
||||
|
||||
This macro defined in the [include/linux/irqflags.h](https://github.com/torvalds/linux/blob/master/include/linux/irqflags.h) header file and as you can understand, we can disable interrupts for the CPU with the call of this macro. Let's look on its implementation. First of all note that it depends on the `CONFIG_TRACE_IRQFLAGS_SUPPORT` kernel configuration option:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT
|
||||
...
|
||||
#define local_irq_disable() \
|
||||
do { raw_local_irq_disable(); trace_hardirqs_off(); } while (0)
|
||||
...
|
||||
#else
|
||||
...
|
||||
#define local_irq_disable() do { raw_local_irq_disable(); } while (0)
|
||||
...
|
||||
#endif
|
||||
```
|
||||
|
||||
They are both similar and as you can see have only one difference: the `local_irq_disable` macro contains call of the `trace_hardirqs_off` when `CONFIG_TRACE_IRQFLAGS_SUPPORT` is enabled. There is special feature in the [lockdep](http://lwn.net/Articles/321663/) subsystem - `irq-flags tracing` for tracing `hardirq` and `stoftirq` state. In ourcase `lockdep` subsytem can give us interesting information about hard/soft irqs on/off events which are occurs in the system. The `trace_hardirqs_off` function defined in the [kernel/locking/lockdep.c](https://github.com/torvalds/linux/blob/master/kernel/locking/lockdep.c):
|
||||
|
||||
```C
|
||||
void trace_hardirqs_off(void)
|
||||
{
|
||||
trace_hardirqs_off_caller(CALLER_ADDR0);
|
||||
}
|
||||
EXPORT_SYMBOL(trace_hardirqs_off);
|
||||
```
|
||||
|
||||
and just calls `trace_hardirqs_off_caller` function. The `trace_hardirqs_off_caller` checks the `hardirqs_enabled` filed of the current process increment the `redundant_hardirqs_off` if call of the `local_irq_disable` was redundant or the `hardirqs_off_events` if it was not. These two fields and other `lockdep` statistic related fields are defined in the [kernel/locking/lockdep_internals.h](https://github.com/torvalds/linux/blob/master/kernel/locking/lockdep_internals.h) and located in the `lockdep_stats` structure:
|
||||
|
||||
```C
|
||||
struct lockdep_stats {
|
||||
...
|
||||
...
|
||||
...
|
||||
int softirqs_off_events;
|
||||
int redundant_softirqs_off;
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
If you will set `CONFIG_DEBUG_LOCKDEP` kernel configuration option, the `lockdep_stats_debug_show` function will write all tracing information to the `/proc/lockdep`:
|
||||
|
||||
```C
|
||||
static void lockdep_stats_debug_show(struct seq_file *m)
|
||||
{
|
||||
#ifdef CONFIG_DEBUG_LOCKDEP
|
||||
unsigned long long hi1 = debug_atomic_read(hardirqs_on_events),
|
||||
hi2 = debug_atomic_read(hardirqs_off_events),
|
||||
hr1 = debug_atomic_read(redundant_hardirqs_on),
|
||||
...
|
||||
...
|
||||
...
|
||||
seq_printf(m, " hardirq on events: %11llu\n", hi1);
|
||||
seq_printf(m, " hardirq off events: %11llu\n", hi2);
|
||||
seq_printf(m, " redundant hardirq ons: %11llu\n", hr1);
|
||||
#endif
|
||||
}
|
||||
```
|
||||
|
||||
and you can see its result with the:
|
||||
|
||||
```
|
||||
$ sudo cat /proc/lockdep
|
||||
hardirq on events: 12838248974
|
||||
hardirq off events: 12838248979
|
||||
redundant hardirq ons: 67792
|
||||
redundant hardirq offs: 3836339146
|
||||
softirq on events: 38002159
|
||||
softirq off events: 38002187
|
||||
redundant softirq ons: 0
|
||||
redundant softirq offs: 0
|
||||
```
|
||||
|
||||
Ok, now we know a little about tracing, but more info will be in the separate part about `lockdep` and `tracing`. You can see that the both `local_disable_irq` macros have the same part - `raw_local_irq_disable`. This macro defined in the [arch/x86/include/asm/irqflags.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irqflags.h) and expands to the call of the:
|
||||
|
||||
```C
|
||||
static inline void native_irq_disable(void)
|
||||
{
|
||||
asm volatile("cli": : :"memory");
|
||||
}
|
||||
```
|
||||
|
||||
And you already must remember that `cli` instruction clears the [IF](http://en.wikipedia.org/wiki/Interrupt_flag) flag which determines ability of a processor to handle and interrupt or an exception. Besides the `local_irq_disable`, as you already can know there is an inverse macr - `local_irq_enable`. This macro has the same tracing mechanism and very similar on the `local_irq_enable`, but as you can understand from its name, it enables interrupts with the `sti` instruction:
|
||||
|
||||
```C
|
||||
static inline void native_irq_enable(void)
|
||||
{
|
||||
asm volatile("sti": : :"memory");
|
||||
}
|
||||
```
|
||||
|
||||
Now we know how `local_irq_disable` and `local_irq_enable` work. It was the first call of the `local_irq_disable` macro, but we will meet these macros many times in the Linux kernel source code. But for now we are in the `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) and we just disabled `local` interrupts. Why local and why we did it? Previously kernel provided a method to disable interrupts on all processors and it was called `cli`. This function was [removed](https://lwn.net/Articles/291956/) and now we have `local_irq_{enabled,disable}` to disable or enable interrupts on the current processor. After we've disabled the interrupts with the `local_irq_disable` macro, we set the:
|
||||
|
||||
```C
|
||||
early_boot_irqs_disabled = true;
|
||||
```
|
||||
|
||||
The `early_boot_irqs_disabled` variable defined in the [include/linux/kernel.h](https://github.com/torvalds/linux/blob/master/include/linux/kernel.h):
|
||||
|
||||
```C
|
||||
extern bool early_boot_irqs_disabled;
|
||||
```
|
||||
|
||||
and used in the different places. For example it used in the `smp_call_function_many` function from the [kernel/smp.c](https://github.com/torvalds/linux/blob/master/kernel/smp.c) for the checking possible deadlock when interrupts are disabled:
|
||||
|
||||
```C
|
||||
WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
|
||||
&& !oops_in_progress && !early_boot_irqs_disabled);
|
||||
```
|
||||
|
||||
Early trap initialization during kernel initialization
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next functions after the `local_disable_irq` are `boot_cpu_init` and `page_address_init`, but they are not related to the interrupts and exceptions (more about this functions you can read in the chapter about Linux kernel [initialization process](http://0xax.gitbooks.io/linux-insides/content/Initialization/index.html)). The next is the `setup_arch` function. As you can remember this function located in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel.setup.c) source code file and makes initialization of many different architecture-dependent [stuff](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html). The first interrupts related function which we can see in the `setup_arch` is the - `early_trap_init` function. This function defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) and fills `Interrupt Descriptor Table` with the couple of entries:
|
||||
|
||||
```C
|
||||
void __init early_trap_init(void)
|
||||
{
|
||||
set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
|
||||
set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
|
||||
#ifdef CONFIG_X86_32
|
||||
set_intr_gate(X86_TRAP_PF, page_fault);
|
||||
#endif
|
||||
load_idt(&idt_descr);
|
||||
}
|
||||
```
|
||||
|
||||
Here we can see calls of three different functions:
|
||||
|
||||
* `set_intr_gate_ist`
|
||||
* `set_system_intr_gate_ist`
|
||||
* `set_intr_gate`
|
||||
|
||||
All of these functions defined in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h) and do the similar thing but not the same. The first `set_intr_gate_ist` function inserts new an interrupt gate in the `IDT`. Let's look on its implementation:
|
||||
|
||||
```C
|
||||
static inline void set_intr_gate_ist(int n, void *addr, unsigned ist)
|
||||
{
|
||||
BUG_ON((unsigned)n > 0xFF);
|
||||
_set_gate(n, GATE_INTERRUPT, addr, 0, ist, __KERNEL_CS);
|
||||
}
|
||||
```
|
||||
|
||||
First of all we can see the check that `n` which is [vector number](http://en.wikipedia.org/wiki/Interrupt_vector_table) of the interrupt is not greater than `0xff` or 255. We need to check it because we remember from the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html) that vector number of an interrupt must be between `0` and `255`. In the next step we can see the call of the `_set_gate` function that sets a given interrupt gate to the `IDT` table:
|
||||
|
||||
```C
|
||||
static inline void _set_gate(int gate, unsigned type, void *addr,
|
||||
unsigned dpl, unsigned ist, unsigned seg)
|
||||
{
|
||||
gate_desc s;
|
||||
|
||||
pack_gate(&s, type, (unsigned long)addr, dpl, ist, seg);
|
||||
write_idt_entry(idt_table, gate, &s);
|
||||
write_trace_idt_entry(gate, &s);
|
||||
}
|
||||
```
|
||||
|
||||
Here we start from the `pack_gate` function which takes clean `IDT` entry represented by the `gate_desc` structure and fills it with the base address and limit, [Interrupt Stack Table](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks), [Privilege level](http://en.wikipedia.org/wiki/Privilege_level), type of an interrupt which can be one of the following values:
|
||||
|
||||
* `GATE_INTERRUPT`
|
||||
* `GATE_TRAP`
|
||||
* `GATE_CALL`
|
||||
* `GATE_TASK`
|
||||
|
||||
and set the present bit for the given `IDT` entry:
|
||||
|
||||
```C
|
||||
static inline void pack_gate(gate_desc *gate, unsigned type, unsigned long func,
|
||||
unsigned dpl, unsigned ist, unsigned seg)
|
||||
{
|
||||
gate->offset_low = PTR_LOW(func);
|
||||
gate->segment = __KERNEL_CS;
|
||||
gate->ist = ist;
|
||||
gate->p = 1;
|
||||
gate->dpl = dpl;
|
||||
gate->zero0 = 0;
|
||||
gate->zero1 = 0;
|
||||
gate->type = type;
|
||||
gate->offset_middle = PTR_MIDDLE(func);
|
||||
gate->offset_high = PTR_HIGH(func);
|
||||
}
|
||||
```
|
||||
|
||||
After this we write just filled interrupt gate to the `IDT` with the `write_idt_entry` macro which expands to the `native_write_idt_entry` and just copy the interrupt gate to the `idt_table` table by the given index:
|
||||
|
||||
```C
|
||||
#define write_idt_entry(dt, entry, g) native_write_idt_entry(dt, entry, g)
|
||||
|
||||
static inline void native_write_idt_entry(gate_desc *idt, int entry, const gate_desc *gate)
|
||||
{
|
||||
memcpy(&idt[entry], gate, sizeof(*gate));
|
||||
}
|
||||
```
|
||||
|
||||
where `idt_table` is just array of `gate_desc`:
|
||||
|
||||
```C
|
||||
extern gate_desc idt_table[];
|
||||
```
|
||||
|
||||
That's all. The second `set_system_intr_gate_ist` function has only one difference from the `set_intr_gate_ist`:
|
||||
|
||||
```C
|
||||
static inline void set_system_intr_gate_ist(int n, void *addr, unsigned ist)
|
||||
{
|
||||
BUG_ON((unsigned)n > 0xFF);
|
||||
_set_gate(n, GATE_INTERRUPT, addr, 0x3, ist, __KERNEL_CS);
|
||||
}
|
||||
```
|
||||
|
||||
Do you see it? Look on the fourth parameter of the `_set_gate`. It is `0x3`. In the `set_intr_gate` it was `0x0`. We know that this parameter represent `DPL` or privilege level. We also know that `0` is the highest privilge level and `3` is the lowest.Now we know how `set_system_intr_gate_ist`, `set_intr_gate_ist`, `set_intr_gate` are work and we can return to the `early_trap_init` function. Let's look on it again:
|
||||
|
||||
```C
|
||||
set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
|
||||
set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
|
||||
```
|
||||
|
||||
We set two `IDT` entries for the `#DB` interrupt and `int3`. These functions takes the same set of parameters:
|
||||
|
||||
* vector number of an interrupt;
|
||||
* address of an interrupt handler;
|
||||
* interrupt stack table index.
|
||||
|
||||
That's all. More about interrupts and handlers you will know in the next parts.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the second part about interrupts and interrupt handling in the Linux kernel. We saw the some theory in the previous part and started to dive into interrupts and exceptions handling in the current part. We have started from the earliest parts in the Linux kernel source code which are related to the interrupts. In the next part we will continue to dive into this interesting theme and will know more about interrupt handling process.
|
||||
|
||||
If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [IDT](http://en.wikipedia.org/wiki/Interrupt_descriptor_table)
|
||||
* [Protected mode](http://en.wikipedia.org/wiki/Protected_mode)
|
||||
* [List of x86 calling conventions](http://en.wikipedia.org/wiki/X86_calling_conventions#List_of_x86_calling_conventions)
|
||||
* [8086](http://en.wikipedia.org/wiki/Intel_8086)
|
||||
* [Long mode](http://en.wikipedia.org/wiki/Long_mode)
|
||||
* [NX](http://en.wikipedia.org/wiki/NX_bit)
|
||||
* [Extended Feature Enable Register](http://en.wikipedia.org/wiki/Control_register#Additional_Control_registers_in_x86-64_series)
|
||||
* [Model-specific register](http://en.wikipedia.org/wiki/Model-specific_register)
|
||||
* [Process identifier](https://en.wikipedia.org/wiki/Process_identifier)
|
||||
* [lockdep](http://lwn.net/Articles/321663/)
|
||||
* [irqflags tracing](https://www.kernel.org/doc/Documentation/irqflags-tracing.txt)
|
||||
* [IF](http://en.wikipedia.org/wiki/Interrupt_flag)
|
||||
* [Stack canary](http://en.wikipedia.org/wiki/Stack_buffer_overflow#Stack_canaries)
|
||||
* [Union type](http://en.wikipedia.org/wiki/Union_type)
|
||||
* [this_cpu_* operations](https://github.com/torvalds/linux/blob/master/Documentation/this_cpu_ops.txt)
|
||||
* [vector number](http://en.wikipedia.org/wiki/Interrupt_vector_table)
|
||||
* [Interrupt Stack Table](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks)
|
||||
* [Privilege level](http://en.wikipedia.org/wiki/Privilege_level)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-1.html)
|
@ -0,0 +1,470 @@
|
||||
Interrupts and Interrupt Handling. Part 3.
|
||||
================================================================================
|
||||
|
||||
Interrupt handlers
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the third part of the [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) about an interrupts and an exceptions handling and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) we stoped in the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blame/master/arch/x86/kernel/setup.c) on the setting of the two exceptions handlers for the two following exceptions:
|
||||
|
||||
* `#DB` - debug exception, transfers control from the interrupted process to the debug handler;
|
||||
* `#BP` - breakpoint exception, caused by the `int 3` instruction.
|
||||
|
||||
These exceptions allow the `x86_64` architecture to have early exception processing for the purpose of debugging via the [kgdb](https://en.wikipedia.org/wiki/KGDB).
|
||||
|
||||
As you can remember we set these exceptions handlers in the `early_trap_init` function:
|
||||
|
||||
```C
|
||||
void __init early_trap_init(void)
|
||||
{
|
||||
set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
|
||||
set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
|
||||
load_idt(&idt_descr);
|
||||
}
|
||||
```
|
||||
|
||||
from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). We already saw implementation of the `set_intr_gate_ist` and `set_system_intr_gate_ist` functions in the previous part and now we will look on the implementation of these early exceptions handlers.
|
||||
|
||||
Debug and Breakpoint exceptions
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Ok, we set the interrupts gates in the `early_trap_init` function for the `#DB` and `#BP` exceptions and now time is to look on their handlers. But first of all let's look on these exceptions. The first exceptions - `#DB` or debug exception occurs when a debug event occurs, for example attempt to change the contents of a [debug register](http://en.wikipedia.org/wiki/X86_debug_register). Debug registers are special registers which present in processors starting from the [Intel 80386](http://en.wikipedia.org/wiki/Intel_80386) and as you can understand from its name they are used for debugging. These registers allow to set breakpoints on the code and read or write data to trace, thus tracking the place of errors. The debug registers are privileged resources available and the program in either real-address or protected mode at `CPL` is `0`, that's why we have used `set_intr_gate_ist` for the `#DB`, but not the `set_system_intr_gate_ist`. The verctor number of the `#DB` exceptions is `1` (we pass it as `X86_TRAP_DB`) and has no error code:
|
||||
|
||||
```
|
||||
----------------------------------------------------------------------------------------------
|
||||
|Vector|Mnemonic|Description |Type |Error Code|Source |
|
||||
----------------------------------------------------------------------------------------------
|
||||
|1 | #DB |Reserved |F/T |NO | |
|
||||
----------------------------------------------------------------------------------------------
|
||||
```
|
||||
|
||||
The second is `#BP` or breakpoint exception occurs when processor executes the [INT 3](http://en.wikipedia.org/wiki/INT_%28x86_instruction%29#INT_3) instruction. We can add it anywhere in our code, for example let's look on the simple program:
|
||||
|
||||
```C
|
||||
// breakpoint.c
|
||||
#include <stdio.h>
|
||||
|
||||
int main() {
|
||||
int i;
|
||||
while (i < 6){
|
||||
printf("i equal to: %d\n", i);
|
||||
__asm__("int3");
|
||||
++i;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
If we will compile and run this program, we will see following output:
|
||||
|
||||
```
|
||||
$ gcc breakpoint.c -o breakpoint
|
||||
i equal to: 0
|
||||
Trace/breakpoint trap
|
||||
```
|
||||
|
||||
But if will run it with gdb, we will see our breakpoint and can continue execution of our program:
|
||||
|
||||
```
|
||||
$ gdb breakpoint
|
||||
...
|
||||
...
|
||||
...
|
||||
(gdb) run
|
||||
Starting program: /home/alex/breakpoints
|
||||
i equal to: 0
|
||||
|
||||
Program received signal SIGTRAP, Trace/breakpoint trap.
|
||||
0x0000000000400585 in main ()
|
||||
=> 0x0000000000400585 <main+31>: 83 45 fc 01 add DWORD PTR [rbp-0x4],0x1
|
||||
(gdb) c
|
||||
Continuing.
|
||||
i equal to: 1
|
||||
|
||||
Program received signal SIGTRAP, Trace/breakpoint trap.
|
||||
0x0000000000400585 in main ()
|
||||
=> 0x0000000000400585 <main+31>: 83 45 fc 01 add DWORD PTR [rbp-0x4],0x1
|
||||
(gdb) c
|
||||
Continuing.
|
||||
i equal to: 2
|
||||
|
||||
Program received signal SIGTRAP, Trace/breakpoint trap.
|
||||
0x0000000000400585 in main ()
|
||||
=> 0x0000000000400585 <main+31>: 83 45 fc 01 add DWORD PTR [rbp-0x4],0x1
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
Now we know a little about these two exceptions and we can move on to consideration of their handlers.
|
||||
|
||||
Preparation before an interrupt handler
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
As you can note, the `set_intr_gate_ist` and `set_system_intr_gate_ist` functions takes an addresses of the exceptions handlers in the second parameter:
|
||||
|
||||
* `&debug`;
|
||||
* `&int3`.
|
||||
|
||||
You will not find these functions in the C code. All that can be found in in the `*.c/*.h` files only definition of this functions in the [arch/x86/include/asm/traps.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/traps.h):
|
||||
|
||||
```C
|
||||
asmlinkage void debug(void);
|
||||
asmlinkage void int3(void);
|
||||
```
|
||||
|
||||
But we can see `asmlinkage` descriptor here. The `asmlinkage` is the special specificator of the [gcc](http://en.wikipedia.org/wiki/GNU_Compiler_Collection). Actually for a `C` functions which are will be called from assembly, we need in explicit declaration of the function calling convention. In our case, if function maked with `asmlinkage` descriptor, then `gcc` will compile the function to retrieve parameters from stack. So, both handlers are defined in the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) assembly source code file with the `idtentry` macro:
|
||||
|
||||
```assembly
|
||||
idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
|
||||
idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
|
||||
```
|
||||
|
||||
Actually `debug` and `int3` are not interrupts handlers. Remember that before we can execute an interrupt/exception handler, we need to do some preparations as:
|
||||
|
||||
* When an interrupt or exception occured, the processor uses an exception or interrupt vector as an index to a descriptor in the `IDT`;
|
||||
* In legacy mode `ss:esp` registers are pushed on the stack only if privilege level changed. In 64-bit mode `ss:rsp` pushed on the stack everytime;
|
||||
* During stack switching with `IST` the new `ss` selector is forced to null. Old `ss` and `rsp` are pushed on the new stack.
|
||||
* The `rflags`, `cs`, `rip` and error code pushed on the stack;
|
||||
* Control transfered to an interrupt handler;
|
||||
* After an interrupt handler will finish its work and finishes with the `iret` instruction, old `ss` will be poped from the stack and loaded to the `ss` register.
|
||||
* `ss:rsp` will be popped from the stack unconditionally in the 64-bit mode and will be popped only if there is a privilege level change in legacy mode.
|
||||
* `iret` instruction will restore `rip`, `cs` and `rflags`;
|
||||
* Interrupted program will continue its execution.
|
||||
|
||||
```
|
||||
+--------------------+
|
||||
+40 | ss |
|
||||
+32 | rsp |
|
||||
+24 | rflags |
|
||||
+16 | cs |
|
||||
+8 | rip |
|
||||
0 | error code |
|
||||
+--------------------+
|
||||
```
|
||||
|
||||
Now we can see on the preparations before a process will transfer control to an interrupt/exception handler from practical side. As I already wrote above the first thirteen exceptions handlers defined in the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) assembly file with the [idtentry](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S#L967) macro:
|
||||
|
||||
```assembly
|
||||
.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
|
||||
ENTRY(\sym)
|
||||
...
|
||||
...
|
||||
...
|
||||
END(\sym)
|
||||
.endm
|
||||
```
|
||||
|
||||
This macro defines an exception entry point and as we can see it takes `five` arguments:
|
||||
|
||||
* `sym` - defines global symbol with the `.globl name`.
|
||||
* `do_sym` - an interrupt handler.
|
||||
* `has_error_code:req` - information about error code, The `:req` qualifier tells the assembler that the argument is required;
|
||||
* `paranoid` - shows us how we need to check current mode;
|
||||
* `shift_ist` - shows us what's stack to use;
|
||||
|
||||
As we can see our exceptions handlers are almost the same:
|
||||
|
||||
```assembly
|
||||
idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
|
||||
idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
|
||||
```
|
||||
|
||||
The differences are only in the global name and name of exceptions handlers. Now let's look how `idtentry` macro implemented. It starts from the two checks:
|
||||
|
||||
```assembly
|
||||
.if \shift_ist != -1 && \paranoid == 0
|
||||
.error "using shift_ist requires paranoid=1"
|
||||
.endif
|
||||
|
||||
.if \has_error_code
|
||||
XCPT_FRAME
|
||||
.else
|
||||
INTR_FRAME
|
||||
.endif
|
||||
```
|
||||
|
||||
First check makes the check that an exceptions uses `Interrupt stack table` and `paranoid` is set, in other way it emits the erorr with the [.error](https://sourceware.org/binutils/docs/as/Error.html#Error) directive. The second `if` clause checks existence of an error code and calls `XCPT_FRAME` or `INTR_FRAME` macros depends on it. These macros just expand to the set of [CFI directives](https://sourceware.org/binutils/docs/as/CFI-directives.html) which are used by `GNU AS` to manage call frames. The `CFI` directives are used only to generate [dwarf2](http://en.wikipedia.org/wiki/DWARF) unwind information for better backtraces and they don't change any code, so we will not go into detail about it and from this point I will skip all code which is related to these directives. In the next step we check error code again and push it on the stack if an exception has it with the:
|
||||
|
||||
```assembly
|
||||
.ifeq \has_error_code
|
||||
pushq_cfi $-1
|
||||
.endif
|
||||
```
|
||||
|
||||
The `pushq_cfi` macro defined in the [arch/x86/include/asm/dwarf2.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/dwarf2.h) and expands to the `pushq` instruction which pushes given error code:
|
||||
|
||||
```assembly
|
||||
.macro pushq_cfi reg
|
||||
pushq \reg
|
||||
CFI_ADJUST_CFA_OFFSET 8
|
||||
.endm
|
||||
```
|
||||
|
||||
Pay attention on the `$-1`. We already know that when an exception occrus, the processor pushes `ss`, `rsp`, `rflags`, `cs` and `rip` on the stack:
|
||||
|
||||
```C
|
||||
#define RIP 16*8
|
||||
#define CS 17*8
|
||||
#define EFLAGS 18*8
|
||||
#define RSP 19*8
|
||||
#define SS 20*8
|
||||
```
|
||||
|
||||
With the `pushq \reg` we denote that place before the `RIP` will contain error code of an exception:
|
||||
|
||||
```C
|
||||
#define ORIG_RAX 15*8
|
||||
```
|
||||
|
||||
The `ORIG_RAX` will contain error code of an exception, [IRQ](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) number on a hardware interrupt and system call number on [system call](http://en.wikipedia.org/wiki/System_call) entry. In the next step we can see thr `ALLOC_PT_GPREGS_ON_STACK` macro which allocates space for the 15 general purpose registers on the stack:
|
||||
|
||||
```assembly
|
||||
.macro ALLOC_PT_GPREGS_ON_STACK addskip=0
|
||||
subq $15*8+\addskip, %rsp
|
||||
CFI_ADJUST_CFA_OFFSET 15*8+\addskip
|
||||
.endm
|
||||
```
|
||||
|
||||
After this we check `paranoid` and if it is set we check first three `CPL` bits. We compare it with the `3` and it allows us to know did we come from userspace or not:
|
||||
|
||||
```assembly
|
||||
.if \paranoid
|
||||
.if \paranoid == 1
|
||||
CFI_REMEMBER_STATE
|
||||
testl $3, CS(%rsp)
|
||||
jnz 1f
|
||||
.endif
|
||||
call paranoid_entry
|
||||
.else
|
||||
call error_entry
|
||||
.endif
|
||||
```
|
||||
|
||||
If we came from userspace we jump on the label `1` which starts from the `call error_entry` instruction. The `error_entry` saves all registers in the `pt_regs` structure which presetens an interrupt/exception stack frame and defined in the [arch/x86/include/uapi/asm/ptrace.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/ptrace.h). It saves common and extra registers on the stack with the:
|
||||
|
||||
```assembly
|
||||
SAVE_C_REGS 8
|
||||
SAVE_EXTRA_REGS 8
|
||||
```
|
||||
|
||||
from `rdi` to `r15` and executes [swapgs](http://www.felixcloutier.com/x86/SWAPGS.html) instruction. This instruction provides a method to for the Linux kernel to obtain a pointer to the kernel data structures and save the user's `gsbase`. After this we will exit from the `error_entry` with the `ret` instruction. After the `error_entry` finished to execute, since we came from userspace we need to switch on kernel interrupt stack:
|
||||
|
||||
```assembly
|
||||
movq %rsp,%rdi
|
||||
call sync_regs
|
||||
```
|
||||
|
||||
We just save all registers to the `error_entry` in the `error_entry`, we put address of the `pt_regs` to the `rdi` and call `sync_regs` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c):
|
||||
|
||||
```C
|
||||
asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs)
|
||||
{
|
||||
struct pt_regs *regs = task_pt_regs(current);
|
||||
*regs = *eregs;
|
||||
return regs;
|
||||
}
|
||||
```
|
||||
|
||||
This function switchs off the `IST` stack if we came from usermode. After this we switch on the stack which we got from the `sync_regs`:
|
||||
|
||||
```assembly
|
||||
movq %rax,%rsp
|
||||
movq %rsp,%rdi
|
||||
```
|
||||
|
||||
and put pointer of the `pt_regs` again in the `rdi`, and in the last step we call an exception handler:
|
||||
|
||||
```assembly
|
||||
call \do_sym
|
||||
```
|
||||
|
||||
So, realy exceptions handlers are `do_debug` and `do_int3` functions. We will see these function in this part, but little later. First of all let's look on the preparations before a processor will transfer control to an interrupt handler. In another way if `paranoid` is set, but it is not 1, we call `paranoid_entry` which makes almost the same that `error_entry`, but it checks current mode with more slow but accurate way:
|
||||
|
||||
```assembly
|
||||
ENTRY(paranoid_entry)
|
||||
SAVE_C_REGS 8
|
||||
SAVE_EXTRA_REGS 8
|
||||
...
|
||||
...
|
||||
movl $MSR_GS_BASE,%ecx
|
||||
rdmsr
|
||||
testl %edx,%edx
|
||||
js 1f /* negative -> in kernel */
|
||||
SWAPGS
|
||||
...
|
||||
...
|
||||
ret
|
||||
END(paranoid_entry)
|
||||
```
|
||||
|
||||
If `edx` wll be negative, we are in the kernel mode. As we store all registers on the stack, check that we are in the kernel mode, we need to setup `IST` stack if it is set for a given exception, call an exception handler and restore the exception stack:
|
||||
|
||||
```assembly
|
||||
.if \shift_ist != -1
|
||||
subq $EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
|
||||
.endif
|
||||
|
||||
call \do_sym
|
||||
|
||||
.if \shift_ist != -1
|
||||
addq $EXCEPTION_STKSZ, CPU_TSS_IST(\shift_ist)
|
||||
.endif
|
||||
```
|
||||
|
||||
The last step when an exception handler will finish it's work all registers will be restored from the stack with the `RESTORE_C_REGS` and `RESTORE_EXTRA_REGS` macros and control will be returned an interrupted task. That's all. Now we know about preparation before an interrupt/exception handler will start to execute and we can go directly to the implementation of the handlers.
|
||||
|
||||
Implementation of ainterrupts and exceptions handlers
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
Both handlers `do_debug` and `do_int3` defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c) source code file and have two similar things: All interrupts/exceptions handlers marked with the `dotraplinkage` prefix that expands to the:
|
||||
|
||||
```C
|
||||
#define dotraplinkage __visible
|
||||
#define __visible __attribute__((externally_visible))
|
||||
```
|
||||
|
||||
which tells to compiler that something else uses this function (in our case these functions are called from the assembly interrupt preparation code). And also they takes two parameters:
|
||||
|
||||
* pointer to the `pt_regs` structure which contains registers of the interrupted task;
|
||||
* error code.
|
||||
|
||||
First of all let's consider `do_debug` handler. This function starts from the getting previous state with the `ist_enter` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). We call it because we need to know, did we come to the interrupt handler from the kernel mode or user mode.
|
||||
|
||||
```C
|
||||
prev_state = ist_enter(regs);
|
||||
```
|
||||
|
||||
The `ist_enter` function returns previous state context state and executes a couple preprartions before we continue to handle an exception. It starts from the check of the previous mode with the `user_mode_vm` macro. It takes `pt_regs` structure which contains a set of registers of the interrupted task and returns `1` if we came from userspace and `0` if we came from kernel space. According to the previous mode we execute `exception_enter` if we are from the userspace or inform [RCU](https://en.wikipedia.org/wiki/Read-copy-update) if we are from krenel space:
|
||||
|
||||
```C
|
||||
...
|
||||
if (user_mode_vm(regs)) {
|
||||
prev_state = exception_enter();
|
||||
} else {
|
||||
rcu_nmi_enter();
|
||||
prev_state = IN_KERNEL;
|
||||
}
|
||||
...
|
||||
...
|
||||
...
|
||||
return prev_state;
|
||||
```
|
||||
|
||||
After this we load the `DR6` debug registers to the `dr6` variable with the call of the `get_debugreg` macro from the [arch/x86/include/asm/debugreg.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/debugreg.h):
|
||||
|
||||
```C
|
||||
get_debugreg(dr6, 6);
|
||||
dr6 &= ~DR6_RESERVED;
|
||||
```
|
||||
|
||||
The `DR6` debug register is debug status register contains information about the reason for stopping the `#DB` or debug exception handler. After we loaded its value to the `dr6` variable we filter out all reserved bits (`4:12` bits). In the next step we check `dr6` register and previous state with the following `if` condition expression:
|
||||
|
||||
```C
|
||||
if (!dr6 && user_mode_vm(regs))
|
||||
user_icebp = 1;
|
||||
```
|
||||
|
||||
If `dr6` does not show any reasons why we caught this trap we set `user_icebp` to one which means that user-code wants to get [SIGTRAP](https://en.wikipedia.org/wiki/Unix_signal#SIGTRAP) signal. In the next step we check was it [kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt) trap and if yes we go to exit:
|
||||
|
||||
```C
|
||||
if ((dr6 & DR_STEP) && kmemcheck_trap(regs))
|
||||
goto exit;
|
||||
```
|
||||
|
||||
After we did all these checks, we clear the `dr6` register, clear the `DEBUGCTLMSR_BTF` flag which provides single-step on branches debugging, set `dr6` register for the current thread and increase `debug_stack_usage` [per-cpu]([Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)) variable with the:
|
||||
|
||||
```C
|
||||
set_debugreg(0, 6);
|
||||
clear_tsk_thread_flag(tsk, TIF_BLOCKSTEP);
|
||||
tsk->thread.debugreg6 = dr6;
|
||||
debug_stack_usage_inc();
|
||||
```
|
||||
|
||||
As we saved `dr6`, we can allow irqs:
|
||||
|
||||
```C
|
||||
static inline void preempt_conditional_sti(struct pt_regs *regs)
|
||||
{
|
||||
preempt_count_inc();
|
||||
if (regs->flags & X86_EFLAGS_IF)
|
||||
local_irq_enable();
|
||||
}
|
||||
```
|
||||
|
||||
more about `local_irq_enabled` and related stuff you can read in the second part about [interrupts handling in the Linux kernel](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-2.html). In the next step we check the previous mode was [virtual 8086](https://en.wikipedia.org/wiki/Virtual_8086_mode) and handle the trap:
|
||||
|
||||
```C
|
||||
if (regs->flags & X86_VM_MASK) {
|
||||
handle_vm86_trap((struct kernel_vm86_regs *) regs, error_code, X86_TRAP_DB);
|
||||
preempt_conditional_cli(regs);
|
||||
debug_stack_usage_dec();
|
||||
goto exit;
|
||||
}
|
||||
...
|
||||
...
|
||||
...
|
||||
exit:
|
||||
ist_exit(regs, prev_state);
|
||||
```
|
||||
|
||||
If we came not from the virtual 8086 mode, we need to check `dr6` register and previous mode as we did it above. Here we check if step mode debugging is
|
||||
enabled and we are not from the user mode, we enabled step mode debugging in the `dr6` copy in the current thread, set `TIF_SINGLE_STEP` falg and re-enable [Trap flag](https://en.wikipedia.org/wiki/Trap_flag) for the user mode:
|
||||
|
||||
```C
|
||||
if ((dr6 & DR_STEP) && !user_mode(regs)) {
|
||||
tsk->thread.debugreg6 &= ~DR_STEP;
|
||||
set_tsk_thread_flag(tsk, TIF_SINGLESTEP);
|
||||
regs->flags &= ~X86_EFLAGS_TF;
|
||||
}
|
||||
```
|
||||
|
||||
Then we get `SIGTRAP` signal code:
|
||||
|
||||
```C
|
||||
si_code = get_si_code(tsk->thread.debugreg6);
|
||||
```
|
||||
|
||||
and send it for user icebp traps:
|
||||
|
||||
```C
|
||||
if (tsk->thread.debugreg6 & (DR_STEP | DR_TRAP_BITS) || user_icebp)
|
||||
send_sigtrap(tsk, regs, error_code, si_code);
|
||||
preempt_conditional_cli(regs);
|
||||
debug_stack_usage_dec();
|
||||
exit:
|
||||
ist_exit(regs, prev_state);
|
||||
```
|
||||
|
||||
In the end we disabled `irqs`, decrement value of the `debug_stack_usage` and exit from the exception handler with the `ist_exit` function.
|
||||
|
||||
The second exception handler is `do_int3` defined in the same source code file - [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). In the `do_int3` we makes almost the same that in the `do_debug` handler. We get the previous state with the `ist_enter`, increment and decrement the `debug_stack_usage` per-cpu variable, enabled and disable local interrupts. But of course there is one difference between these two handlers. We need to lock and than sync processor cores during breakpoint patching.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the third part about interrupts and interrupt handling in the Linux kernel. We saw the initialization of the [Interrupt descriptor table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table) in the previous part with the `#DB` and `#BP` gates and started to dive into preparation before control will be transfered to an exception handler and implementation of some interrupt handlers in this part. In the next part we will continue to dive into this theme and will go next by the `setup_arch` function and will try to understand interrupts handling related stuff.
|
||||
|
||||
If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [Debug registers](http://en.wikipedia.org/wiki/X86_debug_register)
|
||||
* [Intel 80385](http://en.wikipedia.org/wiki/Intel_80386)
|
||||
* [INT 3](http://en.wikipedia.org/wiki/INT_%28x86_instruction%29#INT_3)
|
||||
* [gcc](http://en.wikipedia.org/wiki/GNU_Compiler_Collection)
|
||||
* [TSS](http://en.wikipedia.org/wiki/Task_state_segment)
|
||||
* [GNU assembly .error directive](https://sourceware.org/binutils/docs/as/Error.html#Error)
|
||||
* [dwarf2](http://en.wikipedia.org/wiki/DWARF)
|
||||
* [CFI directives](https://sourceware.org/binutils/docs/as/CFI-directives.html)
|
||||
* [IRQ](http://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
|
||||
* [system call](http://en.wikipedia.org/wiki/System_call)
|
||||
* [swapgs](http://www.felixcloutier.com/x86/SWAPGS.html)
|
||||
* [SIGTRAP](https://en.wikipedia.org/wiki/Unix_signal#SIGTRAP)
|
||||
* [Per-CPU variables](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [kgdb](https://en.wikipedia.org/wiki/KGDB)
|
||||
* [ACPI](https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html)
|
@ -0,0 +1,465 @@
|
||||
Interrupts and Interrupt Handling. Part 4.
|
||||
================================================================================
|
||||
|
||||
Initialization of non-early interrupt gates
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is fourth part about an interrupts and exceptions handling in the Linux kernel and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html) we saw first early `#DB` and `#BP` exceptions handlers from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). We stopped on the right after the `early_trap_init` function that called in the `setup_arch` function which defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/setup.c). In this part we will continue to dive into an interrupts and exceptions handling in the Linux kernel for `x86_64` and continue to do it from from the place where we left off in the last part. First thing which is related to the interrupts and exceptions handling is the setup of the `#PF` or [page fault](https://en.wikipedia.org/wiki/Page_fault) handler with the `early_trap_pf_init` function. Let's start from it.
|
||||
|
||||
Early page fault handler
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The `early_trap_pf_init` function defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). It uses `set_intr_gate` macro that filles [Interrupt Descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table) with the given entry:
|
||||
|
||||
```C
|
||||
void __init early_trap_pf_init(void)
|
||||
{
|
||||
#ifdef CONFIG_X86_64
|
||||
set_intr_gate(X86_TRAP_PF, page_fault);
|
||||
#endif
|
||||
}
|
||||
```
|
||||
|
||||
This macro defined in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/desc.h). We already saw macros like this in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html) - `set_system_intr_gate` and `set_intr_gate_ist`. This macro checks that given vector number is not greater than `255` (maximum vector number) and calls `_set_gate` function as `set_system_intr_gate` and `set_intr_gate_ist` did it:
|
||||
|
||||
```C
|
||||
#define set_intr_gate(n, addr) \
|
||||
do { \
|
||||
BUG_ON((unsigned)n > 0xFF); \
|
||||
_set_gate(n, GATE_INTERRUPT, (void *)addr, 0, 0, \
|
||||
__KERNEL_CS); \
|
||||
_trace_set_gate(n, GATE_INTERRUPT, (void *)trace_##addr,\
|
||||
0, 0, __KERNEL_CS); \
|
||||
} while (0)
|
||||
```
|
||||
|
||||
The `set_intr_gate` macro takes two parameters:
|
||||
|
||||
* vector number of a interrupt;
|
||||
* address of an interrupt handler;
|
||||
|
||||
In our case they are:
|
||||
|
||||
* `X86_TRAP_PF` - `14`;
|
||||
* `page_fault` - the interrupt handler entry point.
|
||||
|
||||
The `X86_TRAP_PF` is the element of enum which defined in the [arch/x86/include/asm/traprs.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/traprs.h):
|
||||
|
||||
```C
|
||||
enum {
|
||||
...
|
||||
...
|
||||
...
|
||||
...
|
||||
X86_TRAP_PF, /* 14, Page Fault */
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
When the `early_trap_pf_init` will be called, the `set_intr_gate` will be expanded to the call of the `_set_gate` which will fill the `IDT` with the handler for the page fault. Now let's look on the implementation of the `page_fault` handler. The `page_fault` handler defined in the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) assembly source code file as all exceptions handlers. Let's look on it:
|
||||
|
||||
```assembly
|
||||
trace_idtentry page_fault do_page_fault has_error_code=1
|
||||
```
|
||||
|
||||
We saw in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html) how `#DB` and `#BP` handlers defined. They were defined with the `idtentry` macro, but here we can see `trace_idtentry`. This macro defined in the same source code file and depends on the `CONFIG_TRACING` kernel configuration option:
|
||||
|
||||
```assembly
|
||||
#ifdef CONFIG_TRACING
|
||||
.macro trace_idtentry sym do_sym has_error_code:req
|
||||
idtentry trace(\sym) trace(\do_sym) has_error_code=\has_error_code
|
||||
idtentry \sym \do_sym has_error_code=\has_error_code
|
||||
.endm
|
||||
#else
|
||||
.macro trace_idtentry sym do_sym has_error_code:req
|
||||
idtentry \sym \do_sym has_error_code=\has_error_code
|
||||
.endm
|
||||
#endif
|
||||
```
|
||||
|
||||
We will not dive into exceptions [Tracing](https://en.wikipedia.org/wiki/Tracing_%28software%29) now. If `CONFIG_TRACING` is not set, we can see that `trace_idtentry` macro just expands to the normal `idtentry`. We already saw implementation of the `idtentry` macro in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html), so let's start from the `page_fault` exception handler.
|
||||
|
||||
As we can see in the `idtentry` definition, the handler of the `page_fault` is `do_page_fault` function which defined in the [arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/fault.c) and as all exceptions handlers it takes two arguments:
|
||||
|
||||
* `regs` - `pt_regs` structure that holds state of an interrupted process;
|
||||
* `error_code` - error code of the page fault exception.
|
||||
|
||||
Let's look inside this function. First of all we read content of the [cr2](https://en.wikipedia.org/wiki/Control_register) control register:
|
||||
|
||||
```C
|
||||
dotraplinkage void notrace
|
||||
do_page_fault(struct pt_regs *regs, unsigned long error_code)
|
||||
{
|
||||
unsigned long address = read_cr2();
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
This register contains a linear address which caused `page fault`. In the next step we make a call of the `exception_enter` function from the [include/linux/context_tracking.h](https://github.com/torvalds/linux/blob/master/include/context_tracking.h). The `exception_enter` and `exception_exit` are functions from context tracking subsytem in the Linux kernel used by the [RCU](https://en.wikipedia.org/wiki/Read-copy-update) to remove its dependency on the timer tick while a processor runs in userspace. Almost in the every exception handler we will see similar code:
|
||||
|
||||
```C
|
||||
enum ctx_state prev_state;
|
||||
prev_state = exception_enter();
|
||||
...
|
||||
... // exception handler here
|
||||
...
|
||||
exception_exit(prev_state);
|
||||
```
|
||||
|
||||
The `exception_enter` function checks that `context tracking` is enabled with the `context_tracking_is_enabled` and if it is in enabled state, we get previous context with te `this_cpu_read` (more about `this_cpu_*` operations you can read in the [Documentation](https://github.com/torvalds/linux/blob/master/Documentation/this_cpu_ops.txt)). After this it calls `context_tracking_user_exit` function which informs that Inform the context tracking that the processor is exiting userspace mode and entering the kernel:
|
||||
|
||||
```C
|
||||
static inline enum ctx_state exception_enter(void)
|
||||
{
|
||||
enum ctx_state prev_ctx;
|
||||
|
||||
if (!context_tracking_is_enabled())
|
||||
return 0;
|
||||
|
||||
prev_ctx = this_cpu_read(context_tracking.state);
|
||||
context_tracking_user_exit();
|
||||
|
||||
return prev_ctx;
|
||||
}
|
||||
```
|
||||
|
||||
The state can be one of the:
|
||||
|
||||
```C
|
||||
enum ctx_state {
|
||||
IN_KERNEL = 0,
|
||||
IN_USER,
|
||||
} state;
|
||||
```
|
||||
|
||||
And in the end we return previous context. Between the `exception_enter` and `exception_exit` we call actual page fault handler:
|
||||
|
||||
```C
|
||||
__do_page_fault(regs, error_code, address);
|
||||
```
|
||||
|
||||
The `__do_page_fault` is defined in the same source code file as `do_page_fault` - [arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/fault.c). In the bingging of the `__do_page_fault` we check state of the [kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt) checker. The `kmemcheck` detects warns about some uses of uninitialized memory. We need to check it because page fault can be caused by kmemcheck:
|
||||
|
||||
```C
|
||||
if (kmemcheck_active(regs))
|
||||
kmemcheck_hide(regs);
|
||||
prefetchw(&mm->mmap_sem);
|
||||
```
|
||||
|
||||
After this we can see the call of the `prefetchw` which executes instruction with the same [name](http://www.felixcloutier.com/x86/PREFETCHW.html) which fetches [X86_FEATURE_3DNOW](https://en.wikipedia.org/?title=3DNow!) to get exclusive [cache line](https://en.wikipedia.org/wiki/CPU_cache). The main purpose of prefetching is to hide the latency of a memory access. In the next step we check that we got page fault not in the kernel space with the following conditiion:
|
||||
|
||||
```C
|
||||
if (unlikely(fault_in_kernel_space(address))) {
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
where `fault_in_kernel_space` is:
|
||||
|
||||
```C
|
||||
static int fault_in_kernel_space(unsigned long address)
|
||||
{
|
||||
return address >= TASK_SIZE_MAX;
|
||||
}
|
||||
```
|
||||
|
||||
The `TASK_SIZE_MAX` macro expands to the:
|
||||
|
||||
```C
|
||||
#define TASK_SIZE_MAX ((1UL << 47) - PAGE_SIZE)
|
||||
```
|
||||
|
||||
or `0x00007ffffffff000`. Pay attention on `unlikely` macro. There are two macros in the Linux kernel:
|
||||
|
||||
```C
|
||||
#define likely(x) __builtin_expect(!!(x), 1)
|
||||
#define unlikely(x) __builtin_expect(!!(x), 0)
|
||||
```
|
||||
|
||||
You can [often](http://lxr.free-electrons.com/ident?i=unlikely) find these macros in the code of the Linux kernel. Main purpose of these macros is optimization. Sometimes this situation is that we need to check the condition of the code and we know that it will rarely be `true` or `false`. With these macros we can tell to the compiler about this. For example
|
||||
|
||||
```C
|
||||
static int proc_root_readdir(struct file *file, struct dir_context *ctx)
|
||||
{
|
||||
if (ctx->pos < FIRST_PROCESS_ENTRY) {
|
||||
int error = proc_readdir(file, ctx);
|
||||
if (unlikely(error <= 0))
|
||||
return error;
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
Here we can see `proc_root_readdir` function which will be called when the Linux [VFS](https://en.wikipedia.org/wiki/Virtual_file_system) needs to read the `root` directory contents. If condition marked with `unlikely`, compiler can put `false` code right after branching. Now let's back to the our address check. Comparison between the given address and the `0x00007ffffffff000` will give us to know, was page fault in the kernel mode or user mode. After this check we know it. After this `__do_page_fault` routine will try to understand the problem that provoked page fault exception and then will pass address to the approprite routine. It can be `kmemcheck` fault, spurious fault, [kprobes](https://www.kernel.org/doc/Documentation/kprobes.txt) fault and etc. Will not dive into implementation details of the page fault exception handler in this part, because we need to know many different concepts which are provided by the Linux kerne, but will see it in the chapter about the [memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html) in the Linux kernel.
|
||||
|
||||
Back to start_kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
There are many different function calls after the `early_trap_pf_init` in the `setup_arch` function from different kernel subsystems, but there are no one interrupts and exceptions handling related. So, we have to go back where we came from - `start_kernel` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c#L492). The first things after the `setup_arch` is the `trap_init` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). This function makes initialization of the remaining exceptions handlers (remember that we already setup 3 handlres for the `#DB` - debug exception, `#BP` - breakpoint exception and `#PF` - page fault exception). The `trap_init` function starts from the check of the [Extended Industry Standard Architecture](https://en.wikipedia.org/wiki/Extended_Industry_Standard_Architecture):
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_EISA
|
||||
void __iomem *p = early_ioremap(0x0FFFD9, 4);
|
||||
|
||||
if (readl(p) == 'E' + ('I'<<8) + ('S'<<16) + ('A'<<24))
|
||||
EISA_bus = 1;
|
||||
early_iounmap(p, 4);
|
||||
#endif
|
||||
```
|
||||
|
||||
Note that it depends on the `CONFIG_EISA` kernel configuration parameter which represetns `EISA` support. Here we use `early_ioremap` function to map `I/O` memory on the page tables. We use `readl` function to read first `4` bytes from the mapped region and if they are equal to `EISA` string we set `EISA_bus` to one. In the end we just unmap previously mapped region. More about `early_ioremap` you can read in the part which describes [Fix-Mapped Addresses and ioremap](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html).
|
||||
|
||||
After this we start to fill the `Interrupt Descriptor Table` with the different interrupt gates. First of all we set `#DE` or `Divide Error` and `#NMI` or `Non-maskable Interrupt`:
|
||||
|
||||
```C
|
||||
set_intr_gate(X86_TRAP_DE, divide_error);
|
||||
set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK);
|
||||
```
|
||||
|
||||
We use `set_intr_gate` macro to set the interrupt gate for the `#DE` exception and `set_intr_gate_ist` for the `#NMI`. You can remember that we already used these macros when we have set the interrupts gates for the page fault handler, debug handler and etc, you can find explanation of it in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html). After this we setup exception gates for the following exceptions:
|
||||
|
||||
```C
|
||||
set_system_intr_gate(X86_TRAP_OF, &overflow);
|
||||
set_intr_gate(X86_TRAP_BR, bounds);
|
||||
set_intr_gate(X86_TRAP_UD, invalid_op);
|
||||
set_intr_gate(X86_TRAP_NM, device_not_available);
|
||||
```
|
||||
|
||||
Here we can see:
|
||||
|
||||
* `#OF` or `Overflow` exception. This exception indicates that an overflow trap occurred when an special [INTO](http://x86.renejeschke.de/html/file_module_x86_id_142.html) instruction was executed;
|
||||
* `#BR` or `BOUND Range exceeded` exception. This exception indeicates that a `BOUND-range-exceed` fault occured when a [BOUND](http://pdos.csail.mit.edu/6.828/2005/readings/i386/BOUND.htm) instruction was executed;
|
||||
* `#UD` or `Invalid Opcode` exception. Occurs when a processor attempted to execute invalid or reserved [opcode](https://en.wikipedia.org/?title=Opcode), processor attempted to execute instruction with invalid operand(s) and etc;
|
||||
* `#NM` or `Device Not Available` exception. Occurs when the processor tries to execute `x87 FPU` floating point instruction while `EM` flag in the [control register](https://en.wikipedia.org/wiki/Control_register#CR0) `cr0` was set.
|
||||
|
||||
In the next step we set the interrupt gate for the `#DF` or `Double fault` exception:
|
||||
|
||||
```C
|
||||
set_intr_gate_ist(X86_TRAP_DF, &double_fault, DOUBLEFAULT_STACK);
|
||||
```
|
||||
|
||||
This exception occurs when processor detected a second exception while calling an exception handler for a prior exception. In usual way when the processor detects another exception while trying to call an exception handler, the two exceptions can be handled serially. If the processor cannot handle them serially, it signals the double-fault or `#DF` exception.
|
||||
|
||||
The following set of the interrupt gates is:
|
||||
|
||||
```C
|
||||
set_intr_gate(X86_TRAP_OLD_MF, &coprocessor_segment_overrun);
|
||||
set_intr_gate(X86_TRAP_TS, &invalid_TSS);
|
||||
set_intr_gate(X86_TRAP_NP, &segment_not_present);
|
||||
set_intr_gate_ist(X86_TRAP_SS, &stack_segment, STACKFAULT_STACK);
|
||||
set_intr_gate(X86_TRAP_GP, &general_protection);
|
||||
set_intr_gate(X86_TRAP_SPURIOUS, &spurious_interrupt_bug);
|
||||
set_intr_gate(X86_TRAP_MF, &coprocessor_error);
|
||||
set_intr_gate(X86_TRAP_AC, &alignment_check);
|
||||
```
|
||||
|
||||
Here we can see setup for the following exception handlers:
|
||||
|
||||
* `#CSO` or `Coprocessor Segment Overrun` - this exception indicates that math [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) of an old processor detected a page or segment violation. Modern processors do not generate this exception
|
||||
* `#TS` or `Invalid TSS` exception - indicates that there was an error related to the [Task State Segment](https://en.wikipedia.org/wiki/Task_state_segment).
|
||||
* `#NP` or `Segement Not Present` exception indicates that the `present flag` of a segment or gate descriptor is clear during attempt to load one of `cs`, `ds`, `es`, `fs`, or `gs` register.
|
||||
* `#SS` or `Stack Fault` exception indicates one of the stack related conditions was detected, for example a not-present stack segment is detected when attempting to load the `ss` register.
|
||||
* `#GP` or `General Protection` exception indicates that the processor detected one of a class of protection violations called general-protection violations. There are many different conditions that can cause general-procetion exception. For example loading the `ss`, `ds`, `es`, `fs`, or `gs` register with a segment selector for a system segment, writing to a code segment or a read-only data segment, referencing an entry in the `Interrupt Descriptor Table` (following an interrupt or exception) that is not an interrupt, trap, or task gate and many many more.
|
||||
* `Spurious Interrupt` - a hardware interrupt that is unwanted.
|
||||
* `#MF` or `x87 FPU Floating-Point Error` exception caused when the [x87 FPU](https://en.wikipedia.org/wiki/X86_instruction_listings#x87_floating-point_instructions) has detected a floating point error.
|
||||
* `#AC` or `Alignment Check` exception Indicates that the processor detected an unaligned memory operand when alignment checking was enabled.
|
||||
|
||||
After that we setup this exception gates, we can see setup of the `Machine-Check` exception:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_X86_MCE
|
||||
set_intr_gate_ist(X86_TRAP_MC, &machine_check, MCE_STACK);
|
||||
#endif
|
||||
```
|
||||
|
||||
Note that it depends on the `CONFIG_X86_MCE` kernel configuration option and indicates that the processor detected an internal [machine error](https://en.wikipedia.org/wiki/Machine-check_exception) or a bus error, or that an external agent detected a bus error. The next exception gate is for the [SIMD](https://en.wikipedia.org/?title=SIMD) Floating-Point exception:
|
||||
|
||||
```C
|
||||
set_intr_gate(X86_TRAP_XF, &simd_coprocessor_error);
|
||||
```
|
||||
|
||||
which indicates the processor has detected an `SSE` or `SSE2` or `SSE3` SIMD floating-point exception. There are six classes of numeric exception conditions that can occur while executing an SIMD floating-point instruction:
|
||||
|
||||
* Invalid operation
|
||||
* Divide-by-zero
|
||||
* Denormal operand
|
||||
* Numeric overflow
|
||||
* Numeric underflow
|
||||
* Inexact result (Precision)
|
||||
|
||||
In the next step we fill the `used_vectors` array which defined in the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/desc.h) header file and represents `bitmap`:
|
||||
|
||||
```C
|
||||
DECLARE_BITMAP(used_vectors, NR_VECTORS);
|
||||
```
|
||||
|
||||
of the first `32` interrupts (more about bitmaps in the Linux kernel you can read in the part which describes [cpumasks and bitmaps](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html))
|
||||
|
||||
```C
|
||||
for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++)
|
||||
set_bit(i, used_vectors)
|
||||
```
|
||||
|
||||
where `FIRST_EXTERNAL_VECTOR` is:
|
||||
|
||||
```C
|
||||
#define FIRST_EXTERNAL_VECTOR 0x20
|
||||
```
|
||||
|
||||
After this we setup the interrupt gate for the `ia32_syscall` and add `0x80` to the `used_vectors` bitmap:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_IA32_EMULATION
|
||||
set_system_intr_gate(IA32_SYSCALL_VECTOR, ia32_syscall);
|
||||
set_bit(IA32_SYSCALL_VECTOR, used_vectors);
|
||||
#endif
|
||||
```
|
||||
|
||||
There is `CONFIG_IA32_EMULATION` kernel configuration option on `x86_64` Linux kernels. This option provides ability to execute 32-bit processes in compatibility-mode. In the next parts we will see how it works, in the meantime we need only to know that there is yet another interrupt gate in the `IDT` with the vector number `0x80`. In the next step we maps `IDT` to the fixmap area:
|
||||
|
||||
```C
|
||||
__set_fixmap(FIX_RO_IDT, __pa_symbol(idt_table), PAGE_KERNEL_RO);
|
||||
idt_descr.address = fix_to_virt(FIX_RO_IDT);
|
||||
```
|
||||
|
||||
and write its address to the `idt_descr.address` (more about fix-mapped addresses you can read in the second part of the [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html) chapter). After this we can see the call of the `cpu_init` function that defined in the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c). This function makes initialization of the all `per-cpu` state. In the beginnig of the `cpu_init` we do the following things: First of all we wait while current cpu is initialized and than we call the `cr4_init_shadow` function which stores shadow copy of the `cr4` control register for the current cpu and load CPU microcode if need with the following function calls:
|
||||
|
||||
```C
|
||||
wait_for_master_cpu(cpu);
|
||||
cr4_init_shadow();
|
||||
load_ucode_ap();
|
||||
```
|
||||
|
||||
Next we get the `Task State Segement` for the current cpu and `orig_ist` structure which represents origin `Interrupt Stack Table` values with the:
|
||||
|
||||
```C
|
||||
t = &per_cpu(cpu_tss, cpu);
|
||||
oist = &per_cpu(orig_ist, cpu);
|
||||
```
|
||||
|
||||
As we got values of the `Task State Segement` and `Interrupt Stack Table` for the current processor, we clear following bits in the `cr4` control register:
|
||||
|
||||
```C
|
||||
cr4_clear_bits(X86_CR4_VME|X86_CR4_PVI|X86_CR4_TSD|X86_CR4_DE);
|
||||
```
|
||||
|
||||
with this we disable `vm86` extension, virtual interrupts, timestamp ([RDTSC](https://en.wikipedia.org/wiki/Time_Stamp_Counter) can only be executed with the highest privilege) and debug extension. After this we reload the `Glolbal Descripto Table` and `Interrupt Descriptor table` with the:
|
||||
|
||||
```C
|
||||
switch_to_new_gdt(cpu);
|
||||
loadsegment(fs, 0);
|
||||
load_current_idt();
|
||||
```
|
||||
|
||||
After this we setup array of the Thread-Local Storage Descriptors, configure [NX](https://en.wikipedia.org/wiki/NX_bit) and load CPU microcode. Now is time to setup and load `per-cpu` Task State Segements. We are going in a loop through the all exception stack which is `N_EXCEPTION_STACKS` or `4` and fill it with `Interrupt Stack Tables`:
|
||||
|
||||
```C
|
||||
if (!oist->ist[0]) {
|
||||
char *estacks = per_cpu(exception_stacks, cpu);
|
||||
|
||||
for (v = 0; v < N_EXCEPTION_STACKS; v++) {
|
||||
estacks += exception_stack_sizes[v];
|
||||
oist->ist[v] = t->x86_tss.ist[v] =
|
||||
(unsigned long)estacks;
|
||||
if (v == DEBUG_STACK-1)
|
||||
per_cpu(debug_stack_addr, cpu) = (unsigned long)estacks;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
As we have filled `Task State Segements` with the `Interrupt Stack Tables` we can set `TSS` descriptor for the current processor and load it with the:
|
||||
|
||||
```C
|
||||
set_tss_desc(cpu, t);
|
||||
load_TR_desc();
|
||||
```
|
||||
|
||||
where `set_tss_desc` macro from the [arch/x86/include/asm/desc.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/desc.h) writes given descriptor to the `Global Descriptor Table` of the given processor:
|
||||
|
||||
```C
|
||||
#define set_tss_desc(cpu, addr) __set_tss_desc(cpu, GDT_ENTRY_TSS, addr)
|
||||
static inline void __set_tss_desc(unsigned cpu, unsigned int entry, void *addr)
|
||||
{
|
||||
struct desc_struct *d = get_cpu_gdt_table(cpu);
|
||||
tss_desc tss;
|
||||
set_tssldt_descriptor(&tss, (unsigned long)addr, DESC_TSS,
|
||||
IO_BITMAP_OFFSET + IO_BITMAP_BYTES +
|
||||
sizeof(unsigned long) - 1);
|
||||
write_gdt_entry(d, entry, &tss, DESC_TSS);
|
||||
}
|
||||
```
|
||||
|
||||
and `load_TR_desc` macro expands to the `ltr` or `Load Task Register` instruction:
|
||||
|
||||
```C
|
||||
#define load_TR_desc() native_load_tr_desc()
|
||||
static inline void native_load_tr_desc(void)
|
||||
{
|
||||
asm volatile("ltr %w0"::"q" (GDT_ENTRY_TSS*8));
|
||||
}
|
||||
```
|
||||
|
||||
In the end of the `trap_init` function we can see the following code:
|
||||
|
||||
```C
|
||||
set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
|
||||
set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
|
||||
...
|
||||
...
|
||||
...
|
||||
#ifdef CONFIG_X86_64
|
||||
memcpy(&nmi_idt_table, &idt_table, IDT_ENTRIES * 16);
|
||||
set_nmi_gate(X86_TRAP_DB, &debug);
|
||||
set_nmi_gate(X86_TRAP_BP, &int3);
|
||||
#endif
|
||||
```
|
||||
|
||||
Here we copy `idt_table` to the `nmi_dit_table` and setup exception handlers for the `#DB` or `Debug exception` and `#BR` or `Breakpoint exception`. You can remember that we already set these interrupt gates in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html), so why do we need to setup it again? We setup it again because when we initialized it before in the `early_trap_init` function, the `Task State Segement` was not ready yet, but now it is ready after the call of the `cpu_init` function.
|
||||
|
||||
That's all. Soon we will consider all handlers of these interrupts/exceptions.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the fourth part about interrupts and interrupt handling in the Linux kernel. We saw the initialization of the [Task State Segment](https://en.wikipedia.org/wiki/Task_state_segment) in this part and initialization of the different interrupt handlers as `Divide Error`, `Page Fault` excetpion and etc. You can noted that we saw just initialization stuf, and will dive into details about handlers for these exceptions. In the next part we will start to do it.
|
||||
|
||||
If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [page fault](https://en.wikipedia.org/wiki/Page_fault)
|
||||
* [Interrupt Descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table)
|
||||
* [Tracing](https://en.wikipedia.org/wiki/Tracing_%28software%29)
|
||||
* [cr2](https://en.wikipedia.org/wiki/Control_register)
|
||||
* [RCU](https://en.wikipedia.org/wiki/Read-copy-update)
|
||||
* [this_cpu_* operations](https://github.com/torvalds/linux/blob/master/Documentation/this_cpu_ops.txt)
|
||||
* [kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt)
|
||||
* [prefetchw](http://www.felixcloutier.com/x86/PREFETCHW.html)
|
||||
* [3DNow](https://en.wikipedia.org/?title=3DNow!)
|
||||
* [CPU caches](https://en.wikipedia.org/wiki/CPU_cache)
|
||||
* [VFS](https://en.wikipedia.org/wiki/Virtual_file_system)
|
||||
* [Linux kernel memory management](http://0xax.gitbooks.io/linux-insides/content/mm/index.html)
|
||||
* [Fix-Mapped Addresses and ioremap](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html)
|
||||
* [Extended Industry Standard Architecture](https://en.wikipedia.org/wiki/Extended_Industry_Standard_Architecture)
|
||||
* [INT isntruction](https://en.wikipedia.org/wiki/INT_%28x86_instruction%29)
|
||||
* [INTO](http://x86.renejeschke.de/html/file_module_x86_id_142.html)
|
||||
* [BOUND](http://pdos.csail.mit.edu/6.828/2005/readings/i386/BOUND.htm)
|
||||
* [opcode](https://en.wikipedia.org/?title=Opcode)
|
||||
* [control register](https://en.wikipedia.org/wiki/Control_register#CR0)
|
||||
* [x87 FPU](https://en.wikipedia.org/wiki/X86_instruction_listings#x87_floating-point_instructions)
|
||||
* [MCE exception](https://en.wikipedia.org/wiki/Machine-check_exception)
|
||||
* [SIMD](https://en.wikipedia.org/?title=SIMD)
|
||||
* [cpumasks and bitmaps](http://0xax.gitbooks.io/linux-insides/content/Concepts/cpumask.html)
|
||||
* [NX](https://en.wikipedia.org/wiki/NX_bit)
|
||||
* [Task State Segment](https://en.wikipedia.org/wiki/Task_state_segment)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html)
|
@ -0,0 +1,493 @@
|
||||
Interrupts and Interrupt Handling. Part 5.
|
||||
================================================================================
|
||||
|
||||
Implementation of exception handlers
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the fifth part about an interrupts and exceptions handling in the Linux kernel and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-4.html) we stopped on the setting of interrupt gates to the [Interrupt descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table). We did it in the `trap_init` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c) source code file. We saw only setting of these interrupt gates in the previous part and in the current part we will see implementation of the exception handlers for these gates. The preparation before an exception handler will be executed is in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file and occurs in the [idtentry](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S#L820) macro that defines exceptions entry points:
|
||||
|
||||
```assembly
|
||||
idtentry divide_error do_divide_error has_error_code=0
|
||||
idtentry overflow do_overflow has_error_code=0
|
||||
idtentry invalid_op do_invalid_op has_error_code=0
|
||||
idtentry bounds do_bounds has_error_code=0
|
||||
idtentry device_not_available do_device_not_available has_error_code=0
|
||||
idtentry coprocessor_segment_overrun do_coprocessor_segment_overrun has_error_code=0
|
||||
idtentry invalid_TSS do_invalid_TSS has_error_code=1
|
||||
idtentry segment_not_present do_segment_not_present has_error_code=1
|
||||
idtentry spurious_interrupt_bug do_spurious_interrupt_bug has_error_code=0
|
||||
idtentry coprocessor_error do_coprocessor_error has_error_code=0
|
||||
idtentry alignment_check do_alignment_check has_error_code=1
|
||||
idtentry simd_coprocessor_error do_simd_coprocessor_error has_error_code=0
|
||||
```
|
||||
|
||||
The `idtentry` macro does following preparation before an actual exception handler (`do_divide_error` for the `divide_error`, `do_overflow` for the `overflow` and etc.) will get control. In another words the `idtentry` macro allocates place for the registers ([pt_regs](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/ptrace.h#L43) structure) on the stack, pushes dummy error code for the stack consistency if an interrupt/exception has no error code, checks the segment selector in the `cs` segment register and switches depends on the previous state(userspace or kernelspace). After all of these preparations it makes a call of an actual interrupt/exception handler:
|
||||
|
||||
```assembly
|
||||
.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
|
||||
ENTRY(\sym)
|
||||
...
|
||||
...
|
||||
...
|
||||
call \do_sym
|
||||
...
|
||||
...
|
||||
...
|
||||
END(\sym)
|
||||
.endm
|
||||
```
|
||||
|
||||
After an exception handler will finish its work, the `idtentry` macro restores stack and general purpose registers of an interrupted task and executes [iret](http://x86.renejeschke.de/html/file_module_x86_id_145.html) instruction:
|
||||
|
||||
```assembly
|
||||
ENTRY(paranoid_exit)
|
||||
...
|
||||
...
|
||||
...
|
||||
RESTORE_EXTRA_REGS
|
||||
RESTORE_C_REGS
|
||||
REMOVE_PT_GPREGS_FROM_STACK 8
|
||||
INTERRUPT_RETURN
|
||||
END(paranoid_exit)
|
||||
```
|
||||
|
||||
where `INTERRUPT_RETURN` is:
|
||||
|
||||
```assembly
|
||||
#define INTERRUPT_RETURN jmp native_iret
|
||||
...
|
||||
ENTRY(native_iret)
|
||||
.global native_irq_return_iret
|
||||
native_irq_return_iret:
|
||||
iretq
|
||||
```
|
||||
|
||||
More about the `idtentry` macro you can read in the thirt part of the [http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-3.html) chapter. Ok, now we saw the preparation before an exception handler will be executed and now time to look on the handlers. First of all let's look on the following handlers:
|
||||
|
||||
* divide_error
|
||||
* overflow
|
||||
* invalid_op
|
||||
* coprocessor_segment_overrun
|
||||
* invalid_TSS
|
||||
* segment_not_present
|
||||
* stack_segment
|
||||
* alignment_check
|
||||
|
||||
All these handlers defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) source code file with the `DO_ERROR` macro:
|
||||
|
||||
```C
|
||||
DO_ERROR(X86_TRAP_DE, SIGFPE, "divide error", divide_error)
|
||||
DO_ERROR(X86_TRAP_OF, SIGSEGV, "overflow", overflow)
|
||||
DO_ERROR(X86_TRAP_UD, SIGILL, "invalid opcode", invalid_op)
|
||||
DO_ERROR(X86_TRAP_OLD_MF, SIGFPE, "coprocessor segment overrun", coprocessor_segment_overrun)
|
||||
DO_ERROR(X86_TRAP_TS, SIGSEGV, "invalid TSS", invalid_TSS)
|
||||
DO_ERROR(X86_TRAP_NP, SIGBUS, "segment not present", segment_not_present)
|
||||
DO_ERROR(X86_TRAP_SS, SIGBUS, "stack segment", stack_segment)
|
||||
DO_ERROR(X86_TRAP_AC, SIGBUS, "alignment check", alignment_check)
|
||||
```
|
||||
|
||||
As we can see the `DO_ERROR` macro takes 4 parameters:
|
||||
|
||||
* Vector number of an interrupt;
|
||||
* Signal number which will be sent to the interrupted process;
|
||||
* String which describes an exception;
|
||||
* Exception handler entry point.
|
||||
|
||||
This macro defined in the same souce code file and expands to the function with the `do_handler` name:
|
||||
|
||||
```C
|
||||
#define DO_ERROR(trapnr, signr, str, name) \
|
||||
dotraplinkage void do_##name(struct pt_regs *regs, long error_code) \
|
||||
{ \
|
||||
do_error_trap(regs, error_code, str, trapnr, signr); \
|
||||
}
|
||||
```
|
||||
|
||||
Note on the `##` tokens. This is special feature - [GCC macro Concatenation](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html#Concatenation) which concatenates two given strings. For example, first `DO_ERROR` in our example will expands to the:
|
||||
|
||||
```C
|
||||
dotraplinkage void do_divide_error(struct pt_regs *regs, long error_code) \
|
||||
{
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
We can see that all functions which are generated by the `DO_ERROR` macro just make a call of the `do_error_trap` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). Let's look on implementation of the `do_error_trap` function.
|
||||
|
||||
Trap handlers
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The `do_error_trap` function starts and ends from the two following functions:
|
||||
|
||||
```C
|
||||
enum ctx_state prev_state = exception_enter();
|
||||
...
|
||||
...
|
||||
...
|
||||
exception_exit(prev_state);
|
||||
```
|
||||
|
||||
from the [include/linux/context_tracking.h](https://github.com/torvalds/linux/tree/master/include/linux/context_tracking.h). The context tracking in the Linux kernel subsystem which provide kernel boundaries probes to keep track of the transitions between level contexts with two basic initial contexts: `user` or `kernel`. The `exception_enter` function checks that context tracking is enabled. After this if it is enabled, the `exception_enter` reads previous context and compares it with the `CONTEXT_KERNEL`. If the previous context is `user`, we call `context_tracking_exit` function from the [kernel/context_tracking.c](https://github.com/torvalds/linux/blob/master/kernel/context_tracking.c) which inform the context tracking subsystem that a processor is exiting user mode and entering the kernel mode:
|
||||
|
||||
```C
|
||||
if (!context_tracking_is_enabled())
|
||||
return 0;
|
||||
|
||||
prev_ctx = this_cpu_read(context_tracking.state);
|
||||
if (prev_ctx != CONTEXT_KERNEL)
|
||||
context_tracking_exit(prev_ctx);
|
||||
|
||||
return prev_ctx;
|
||||
```
|
||||
|
||||
If previous context is non `user`, we just return it. The `pre_ctx` has `enum ctx_state` type which defined in the [include/linux/context_tracking_state.h](https://github.com/torvalds/linux/tree/master/include/linux/context_tracking_state.h) and looks as:
|
||||
|
||||
```C
|
||||
enum ctx_state {
|
||||
CONTEXT_KERNEL = 0,
|
||||
CONTEXT_USER,
|
||||
CONTEXT_GUEST,
|
||||
} state;
|
||||
```
|
||||
|
||||
The second function is `exception_exit` defined in the same [include/linux/context_tracking.h](https://github.com/torvalds/linux/tree/master/include/linux/context_tracking.h) file and checks that context tracking is enabled and call the `contert_tracking_enter` function if the previous context was `user`:
|
||||
|
||||
```C
|
||||
static inline void exception_exit(enum ctx_state prev_ctx)
|
||||
{
|
||||
if (context_tracking_is_enabled()) {
|
||||
if (prev_ctx != CONTEXT_KERNEL)
|
||||
context_tracking_enter(prev_ctx);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The `context_tracking_enter` function informs the context tracking subsystem that a processor is going to enter to the user mode from the kernel mode. We can see the following code between the `exception_enter` and `exception_exit`:
|
||||
|
||||
```C
|
||||
if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, signr) !=
|
||||
NOTIFY_STOP) {
|
||||
conditional_sti(regs);
|
||||
do_trap(trapnr, signr, str, regs, error_code,
|
||||
fill_trap_info(regs, signr, trapnr, &info));
|
||||
}
|
||||
```
|
||||
|
||||
First of all it calls the `notify_die` function which defined in the [kernel/notifier.c](https://github.com/torvalds/linux/tree/master/kernel/notifier.c). To get notified for [kernel panic](https://en.wikipedia.org/wiki/Kernel_panic), [kernel oops](https://en.wikipedia.org/wiki/Linux_kernel_oops), [Non-Maskable Interrupt](https://en.wikipedia.org/wiki/Non-maskable_interrupt) or other events the caller needs to insert itself in the `notify_die` chain and the `notify_die` function does it. The Linux kernel has special mechanism that allows kernel to ask when something happens and this mechanism called `notifiers` or `notifier chains`. This mechanism used for example for the `USB` hotplug events (look on the [drivers/usb/core/notify.c](https://github.com/torvalds/linux/tree/master/drivers/usb/core/notify.c)), for the memory [hotplug](https://en.wikipedia.org/wiki/Hot_swapping) (look on the [include/linux/memory.h](https://github.com/torvalds/linux/tree/master/include/linux/memory.h), the `hotplug_memory_notifier` macro and etc...), system reboots and etc. A notifier chain is thus a simple, singly-linked list. When a Linux kernel subsystem wants to be notified of specific events, it fills out a special `notifier_block` structure and passes it to the `notifier_chain_register` function. An event can be sent with the call of the `notifier_call_chain` function. First of all the `notify_die` function fills `die_args` structure with the trap number, trap string, registers and other values:
|
||||
|
||||
```C
|
||||
struct die_args args = {
|
||||
.regs = regs,
|
||||
.str = str,
|
||||
.err = err,
|
||||
.trapnr = trap,
|
||||
.signr = sig,
|
||||
}
|
||||
```
|
||||
|
||||
and returns the result of the `atomic_notifier_call_chain` function with the `die_chain`:
|
||||
|
||||
```C
|
||||
static ATOMIC_NOTIFIER_HEAD(die_chain);
|
||||
return atomic_notifier_call_chain(&die_chain, val, &args);
|
||||
```
|
||||
|
||||
which just expands to the `atomit_notifier_head` structure that contains lock and `notifier_block`:
|
||||
|
||||
```C
|
||||
struct atomic_notifier_head {
|
||||
spinlock_t lock;
|
||||
struct notifier_block __rcu *head;
|
||||
};
|
||||
```
|
||||
|
||||
The `atomic_notifier_call_chain` function calls each function in a notifier chain in turn and returns the value of the last notifier function called. If the `notify_die` in the `do_error_trap` does not return `NOTIFY_STOP` we execute `conditional_sti` function from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/traps.c) that checks the value of the [interrupt flag](https://en.wikipedia.org/wiki/Interrupt_flag) and enables interrupt depends on it:
|
||||
|
||||
```C
|
||||
static inline void conditional_sti(struct pt_regs *regs)
|
||||
{
|
||||
if (regs->flags & X86_EFLAGS_IF)
|
||||
local_irq_enable();
|
||||
}
|
||||
```
|
||||
|
||||
more about `local_irq_enable` macro you can read in the second [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-2.html) of this chapter. The next and last call in the `do_error_trap` is the `do_trap` function. First of all the `do_trap` function defined the `tsk` variable which has `trak_struct` type and represents the current interrupted process. After the definition of the `tsk`, we can see the call of the `do_trap_no_signal` function:
|
||||
|
||||
```C
|
||||
struct task_struct *tsk = current;
|
||||
|
||||
if (!do_trap_no_signal(tsk, trapnr, str, regs, error_code))
|
||||
return;
|
||||
```
|
||||
|
||||
The `do_trap_no_signal` function makes two checks:
|
||||
|
||||
* Did we come from the [Virtual 8086](https://en.wikipedia.org/wiki/Virtual_8086_mode) mode;
|
||||
* Did we come from the kernelspace.
|
||||
|
||||
```C
|
||||
if (v8086_mode(regs)) {
|
||||
...
|
||||
}
|
||||
|
||||
if (!user_mode(regs)) {
|
||||
...
|
||||
}
|
||||
|
||||
return -1;
|
||||
```
|
||||
|
||||
We will not consider first case because the [long mode](https://en.wikipedia.org/wiki/Long_mode) does not support the [Virtual 8086](https://en.wikipedia.org/wiki/Virtual_8086_mode) mode. In the second case we invoke `fixup_exception` function which will try to recover a fault and `die` if we can't:
|
||||
|
||||
```C
|
||||
if (!fixup_exception(regs)) {
|
||||
tsk->thread.error_code = error_code;
|
||||
tsk->thread.trap_nr = trapnr;
|
||||
die(str, regs, error_code);
|
||||
}
|
||||
```
|
||||
|
||||
The `die` function defined in the [arch/x86/kernel/dumpstack.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/dumpstack.c) source code file, prints useful information about stack, registers, kernel modules and caused kernel [oops](https://en.wikipedia.org/wiki/Linux_kernel_oops). If we came from the userspace the `do_trap_no_signal` function will return `-1` and the execution of the `do_trap` function will continue. If we passed through the `do_trap_no_signal` function and did not exit from the `do_trap` after this, it means that previous context was - `user`. Most exceptions caused by the processor are interpreted by Linux as error conditions, for example division by zero, invalid opcode and etc. When an exception occurs the Linux kernel sends a [signal](https://en.wikipedia.org/wiki/Unix_signal) to the interrupted process that caused the exception to notify it of an incorrect condition. So, in the `do_trap` function we need to send a signal with the given number (`SIGFPE` for the divide error, `SIGILL` for the overflow exception and etc...). First of all we save error code and vector number in the current interrupts process with the filling `thread.error_code` and `thread_trap_nr`:
|
||||
|
||||
```C
|
||||
tsk->thread.error_code = error_code;
|
||||
tsk->thread.trap_nr = trapnr;
|
||||
```
|
||||
|
||||
After this we make a check do we need to print information about unhandled signals for the interrupted process. We check that `show_unhandled_signals` variable is set, that `unhandled_signal` function from the [kernel/signal.c](https://github.com/torvalds/linux/blob/master/kernel/signal.c) will return unhandled signal(s) and [printk](https://en.wikipedia.org/wiki/Printk) rate limit:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_X86_64
|
||||
if (show_unhandled_signals && unhandled_signal(tsk, signr) &&
|
||||
printk_ratelimit()) {
|
||||
pr_info("%s[%d] trap %s ip:%lx sp:%lx error:%lx",
|
||||
tsk->comm, tsk->pid, str,
|
||||
regs->ip, regs->sp, error_code);
|
||||
print_vma_addr(" in ", regs->ip);
|
||||
pr_cont("\n");
|
||||
}
|
||||
#endif
|
||||
```
|
||||
|
||||
And send a given signal to interrupted process:
|
||||
|
||||
```C
|
||||
force_sig_info(signr, info ?: SEND_SIG_PRIV, tsk);
|
||||
```
|
||||
|
||||
This is the end of the `do_trap`. We just saw generic implementation for eight different exceptions which are defined with the `DO_ERROR` macro. Now let's look on another exception handlers.
|
||||
|
||||
Double fault
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next exception is `#DF` or `Double fault`. This exception occurrs when the processor detected a second exception while calling an exception handler for a prior exception. We set the trap gate for this exception in the previous part:
|
||||
|
||||
```C
|
||||
set_intr_gate_ist(X86_TRAP_DF, &double_fault, DOUBLEFAULT_STACK);
|
||||
```
|
||||
|
||||
Note that this exception runs on the `DOUBLEFAULT_STACK` [Interrupt Stack Table](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks) which has index - `1`:
|
||||
|
||||
```C
|
||||
#define DOUBLEFAULT_STACK 1
|
||||
```
|
||||
|
||||
The `double_fault` is handler for this exception and defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). The `double_fault` handler starts from the definition of two variables: string that describes excetpion and interrupted process, as other exception handlers:
|
||||
|
||||
```C
|
||||
static const char str[] = "double fault";
|
||||
struct task_struct *tsk = current;
|
||||
```
|
||||
|
||||
The handler of the double fault exception splitted on two parts. The first part is the check which checks that a fault is a `non-IST` fault on the `espfix64` stack. Actually the `iret` instruction restores only the bottom `16` bits when returning to a `16` bit segment. The `espfix` feature solves this problem. So if the `non-IST` fault on the espfix64 stack we modify the stack to make it look like `General Protection Fault`:
|
||||
|
||||
```C
|
||||
struct pt_regs *normal_regs = task_pt_regs(current);
|
||||
|
||||
memmove(&normal_regs->ip, (void *)regs->sp, 5*8);
|
||||
ormal_regs->orig_ax = 0;
|
||||
regs->ip = (unsigned long)general_protection;
|
||||
regs->sp = (unsigned long)&normal_regs->orig_ax;
|
||||
return;
|
||||
```
|
||||
|
||||
In the second case we do almost the same that we did in the previous excetpion handlers. The first is the call of the `ist_enter` function that discards previous context, `user` in our case:
|
||||
|
||||
```C
|
||||
ist_enter(regs);
|
||||
```
|
||||
|
||||
And after this we fill the interrupted process with the vector number of the `Double fault` excetpion and error code as we did it in the previous handlers:
|
||||
|
||||
```C
|
||||
tsk->thread.error_code = error_code;
|
||||
tsk->thread.trap_nr = X86_TRAP_DF;
|
||||
```
|
||||
|
||||
Next we print useful information about the double fault ([PID](https://en.wikipedia.org/wiki/Process_identifier) number, registers content):
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_DOUBLEFAULT
|
||||
df_debug(regs, error_code);
|
||||
#endif
|
||||
```
|
||||
|
||||
And die:
|
||||
|
||||
```
|
||||
for (;;)
|
||||
die(str, regs, error_code);
|
||||
```
|
||||
|
||||
That's all.
|
||||
|
||||
Device not available exception handler
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next exception is the `#NM` or `Device not available`. The `Device not available` exception can occur depending on these things:
|
||||
|
||||
* The processor executed an [x87 FPU](https://en.wikipedia.org/wiki/X87) floating-point instruction while the EM flag in [control register](https://en.wikipedia.org/wiki/Control_register) `cr0` was set;
|
||||
* The processor executed a `wait` or `fwait` instruction while the `MP` and `TS` flags of register `cr0` were set;
|
||||
* The processor executed an [x87 FPU](https://en.wikipedia.org/wiki/X87), [MMX](https://en.wikipedia.org/wiki/MMX_%28instruction_set%29) or [SSE](https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions) instruction while the `TS` falg in control register `cr0` was set and the `EM` flag is clear.
|
||||
|
||||
The handler of the `Device not available` exception is the `do_device_not_available` function and it defined in the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c) source code file too. It starts and ends from the getting of the previous context, as other traps which we saw in the beginning of this part:
|
||||
|
||||
```C
|
||||
enum ctx_state prev_state;
|
||||
prev_state = exception_enter();
|
||||
...
|
||||
...
|
||||
...
|
||||
exception_exit(prev_state);
|
||||
```
|
||||
|
||||
In the next step we check that `FPU` is not eager:
|
||||
|
||||
```C
|
||||
BUG_ON(use_eager_fpu());
|
||||
```
|
||||
|
||||
When we switch into a task or interrupt we may avoid loading the `FPU` state. If a task will use it, we catch `Device not Available exception` exception. If we loading the `FPU` state during task switching, the `FPU` is eager. In the next step we check `cr0` control register on the `EM` flag which can show us is `x87` floating point unit present (flag clear) or not (flag set):
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_MATH_EMULATION
|
||||
if (read_cr0() & X86_CR0_EM) {
|
||||
struct math_emu_info info = { };
|
||||
|
||||
conditional_sti(regs);
|
||||
|
||||
info.regs = regs;
|
||||
math_emulate(&info);
|
||||
exception_exit(prev_state);
|
||||
return;
|
||||
}
|
||||
#endif
|
||||
```
|
||||
|
||||
If the `x87` floating point unit not presented, we enable interrupts with the `conditional_sti`, fill the `math_emu_info` (defined in the [arch/x86/include/asm/math_emu.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/math_emu.h)) structure with the registers of an interrupt task and call `math_emulate` function from the [arch/x86/math-emu/fpu_entry.c](https://github.com/torvalds/linux/tree/master/arch/x86/math-emu/fpu_entry.c). As you can understand from function's name, it emulates `X87 FPU` unit (more about the `x87` we will know in the special chapter). In other way, if `X86_CR0_EM` flag is clear which means that `x87 FPU` unit is presented, we call the `fpu__restore` function from the [arch/x86/kernel/fpu/core.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/fpu/core.c) which copies the `FPU` registers from the `fpustate` to the live hardware registers. After this `FPU` instructions can be used:
|
||||
|
||||
```C
|
||||
fpu__restore(¤t->thread.fpu);
|
||||
```
|
||||
|
||||
General protection fault exception handler
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The next exception is the `#GP` or `General protection fault`. This exception occurs when the processor detected one of a class of protection violations called `general-protection violations`. It can be:
|
||||
|
||||
* Exceeding the segment limit when accessing the `cs`, `ds`, `es`, `fs` or `gs` segments;
|
||||
* Loading the `ss`, `ds`, `es`, `fs` or `gs` register with a segment selector for a system segment.;
|
||||
* Violating any of the privilege rules;
|
||||
* and other...
|
||||
|
||||
The exception handler for this exception is the `do_general_protection` from the [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). The `do_general_protection` function starts and ends as other exception handlers from the getting of the previous context:
|
||||
|
||||
```C
|
||||
prev_state = exception_enter();
|
||||
...
|
||||
exception_exit(prev_state);
|
||||
```
|
||||
|
||||
After this we enable interrupts if they were disabled and check that we came from the [Virtual 8086](https://en.wikipedia.org/wiki/Virtual_8086_mode) mode:
|
||||
|
||||
```C
|
||||
conditional_sti(regs);
|
||||
|
||||
if (v8086_mode(regs)) {
|
||||
local_irq_enable();
|
||||
handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code);
|
||||
goto exit;
|
||||
}
|
||||
```
|
||||
|
||||
As long mode does not support this mode, we will not consider exception handling for this case. In the next step check that previous mode was kernel mode and try to fix the trap. If we can't fix the current general protection fault exception we fill the interrupted process with the vector number and error code of the exception and add it to the `notify_die` chain:
|
||||
|
||||
```C
|
||||
if (!user_mode(regs)) {
|
||||
if (fixup_exception(regs))
|
||||
goto exit;
|
||||
|
||||
tsk->thread.error_code = error_code;
|
||||
tsk->thread.trap_nr = X86_TRAP_GP;
|
||||
if (notify_die(DIE_GPF, "general protection fault", regs, error_code,
|
||||
X86_TRAP_GP, SIGSEGV) != NOTIFY_STOP)
|
||||
die("general protection fault", regs, error_code);
|
||||
goto exit;
|
||||
}
|
||||
```
|
||||
|
||||
If we can fix exception we go to the `exit` label which exits from exception state:
|
||||
|
||||
```C
|
||||
exit:
|
||||
exception_exit(prev_state);
|
||||
```
|
||||
|
||||
If we came from user mode we send `SIGSEGV` signal to the interrupted process from user mode as we did it in the `do_trap` function:
|
||||
|
||||
```C
|
||||
if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
|
||||
printk_ratelimit()) {
|
||||
pr_info("%s[%d] general protection ip:%lx sp:%lx error:%lx",
|
||||
tsk->comm, task_pid_nr(tsk),
|
||||
regs->ip, regs->sp, error_code);
|
||||
print_vma_addr(" in ", regs->ip);
|
||||
pr_cont("\n");
|
||||
}
|
||||
|
||||
force_sig_info(SIGSEGV, SEND_SIG_PRIV, tsk);
|
||||
```
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
It is the end of the fifth part of the [Interrupts and Interrupt Handling](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) chapter and we saw implementation of some interrupt handlers in this part. In the next part we will continue to dive into interrupt and exception handlers and will see handler for the [Non-Maskable Interrupts](https://en.wikipedia.org/wiki/Non-maskable_interrupt), handling of the math [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) and [SIMD](https://en.wikipedia.org/wiki/SIMD) coprocessor exceptions and many many more.
|
||||
|
||||
If you will have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
||||
|
||||
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you will find any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [Interrupt descriptor Table](https://en.wikipedia.org/wiki/Interrupt_descriptor_table)
|
||||
* [iret instruction](http://x86.renejeschke.de/html/file_module_x86_id_145.html)
|
||||
* [GCC macro Concatenation](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html#Concatenation)
|
||||
* [kernel panic](https://en.wikipedia.org/wiki/Kernel_panic)
|
||||
* [kernel oops](https://en.wikipedia.org/wiki/Linux_kernel_oops)
|
||||
* [Non-Maskable Interrupt](https://en.wikipedia.org/wiki/Non-maskable_interrupt)
|
||||
* [hotplug](https://en.wikipedia.org/wiki/Hot_swapping)
|
||||
* [interrupt flag](https://en.wikipedia.org/wiki/Interrupt_flag)
|
||||
* [long mode](https://en.wikipedia.org/wiki/Long_mode)
|
||||
* [signal](https://en.wikipedia.org/wiki/Unix_signal)
|
||||
* [printk](https://en.wikipedia.org/wiki/Printk)
|
||||
* [coprocessor](https://en.wikipedia.org/wiki/Coprocessor)
|
||||
* [SIMD](https://en.wikipedia.org/wiki/SIMD)
|
||||
* [Interrupt Stack Table](https://www.kernel.org/doc/Documentation/x86/x86_64/kernel-stacks)
|
||||
* [PID](https://en.wikipedia.org/wiki/Process_identifier)
|
||||
* [x87 FPU](https://en.wikipedia.org/wiki/X87)
|
||||
* [control register](https://en.wikipedia.org/wiki/Control_register)
|
||||
* [MMX](https://en.wikipedia.org/wiki/MMX_%28instruction_set%29)
|
||||
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-4.html)
|
Loading…
Reference in new issue