mirror of
https://github.com/0xAX/linux-insides.git
synced 2024-11-18 14:08:09 +00:00
474 lines
30 KiB
Markdown
474 lines
30 KiB
Markdown
Kernel initialization. Part 10.
|
|
================================================================================
|
|
|
|
End of the linux kernel initialization process
|
|
================================================================================
|
|
|
|
This is tenth part of the chapter about linux kernel [initialization process](https://proninyaroslav.gitbooks.io/linux-insides-ru/content/Initialization/index.html) and in the [previous part](https://proninyaroslav.gitbooks.io/linux-insides-ru/content/Initialization/linux-initialization-9.html) we saw the initialization of the [RCU](http://en.wikipedia.org/wiki/Read-copy-update) and stopped on the call of the `acpi_early_init` function. This part will be the last part of the [Kernel initialization process](https://proninyaroslav.gitbooks.io/linux-insides-ru/content/Initialization/index.html) chapter, so let's finish it.
|
|
|
|
After the call of the `acpi_early_init` function from the [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c), we can see the following code:
|
|
|
|
```C
|
|
#ifdef CONFIG_X86_ESPFIX64
|
|
init_espfix_bsp();
|
|
#endif
|
|
```
|
|
|
|
Here we can see the call of the `init_espfix_bsp` function which depends on the `CONFIG_X86_ESPFIX64` kernel configuration option. As we can understand from the function name, it does something with the stack. This function is defined in the [arch/x86/kernel/espfix_64.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/espfix_64.c) and prevents leaking of `31:16` bits of the `esp` register during returning to 16-bit stack. First of all we install `espfix` page upper directory into the kernel page directory in the `init_espfix_bs`:
|
|
|
|
```C
|
|
pgd_p = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
|
|
pgd_populate(&init_mm, pgd_p, (pud_t *)espfix_pud_page);
|
|
```
|
|
|
|
Where `ESPFIX_BASE_ADDR` is:
|
|
|
|
```C
|
|
#define PGDIR_SHIFT 39
|
|
#define ESPFIX_PGD_ENTRY _AC(-2, UL)
|
|
#define ESPFIX_BASE_ADDR (ESPFIX_PGD_ENTRY << PGDIR_SHIFT)
|
|
```
|
|
|
|
Also we can find it in the [Documentation/x86/x86_64/mm](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/x86/x86_64/mm.txt):
|
|
|
|
```
|
|
... unused hole ...
|
|
ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
|
|
... unused hole ...
|
|
```
|
|
|
|
After we've filled page global directory with the `espfix` pud, the next step is call of the `init_espfix_random` and `init_espfix_ap` functions. The first function returns random locations for the `espfix` page and the second enables the `espfix` for the current CPU. After the `init_espfix_bsp` finished the work, we can see the call of the `thread_info_cache_init` function which defined in the [kernel/fork.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/fork.c) and allocates cache for the `thread_info` if `THREAD_SIZE` is less than `PAGE_SIZE`:
|
|
|
|
```C
|
|
# if THREAD_SIZE >= PAGE_SIZE
|
|
...
|
|
...
|
|
...
|
|
void thread_info_cache_init(void)
|
|
{
|
|
thread_info_cache = kmem_cache_create("thread_info", THREAD_SIZE,
|
|
THREAD_SIZE, 0, NULL);
|
|
BUG_ON(thread_info_cache == NULL);
|
|
}
|
|
...
|
|
...
|
|
...
|
|
#endif
|
|
```
|
|
|
|
As we already know the `PAGE_SIZE` is `(_AC(1,UL) << PAGE_SHIFT)` or `4096` bytes and `THREAD_SIZE` is `(PAGE_SIZE << THREAD_SIZE_ORDER)` or `16384` bytes for the `x86_64`. The next function after the `thread_info_cache_init` is the `cred_init` from the [kernel/cred.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/cred.c). This function just allocates cache for the credentials (like `uid`, `gid`, etc.):
|
|
|
|
```C
|
|
void __init cred_init(void)
|
|
{
|
|
cred_jar = kmem_cache_create("cred_jar", sizeof(struct cred),
|
|
0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
|
|
}
|
|
```
|
|
|
|
more about credentials you can read in the [Documentation/security/credentials.txt](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/security/credentials.txt). Next step is the `fork_init` function from the [kernel/fork.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/fork.c). The `fork_init` function allocates cache for the `task_struct`. Let's look on the implementation of the `fork_init`. First of all we can see definitions of the `ARCH_MIN_TASKALIGN` macro and creation of a slab where task_structs will be allocated:
|
|
|
|
```C
|
|
#ifndef CONFIG_ARCH_TASK_STRUCT_ALLOCATOR
|
|
#ifndef ARCH_MIN_TASKALIGN
|
|
#define ARCH_MIN_TASKALIGN L1_CACHE_BYTES
|
|
#endif
|
|
task_struct_cachep =
|
|
kmem_cache_create("task_struct", sizeof(struct task_struct),
|
|
ARCH_MIN_TASKALIGN, SLAB_PANIC | SLAB_NOTRACK, NULL);
|
|
#endif
|
|
```
|
|
|
|
As we can see this code depends on the `CONFIG_ARCH_TASK_STRUCT_ACLLOCATOR` kernel configuration option. This configuration option shows the presence of the `alloc_task_struct` for the given architecture. As `x86_64` has no `alloc_task_struct` function, this code will not work and even will not be compiled on the `x86_64`.
|
|
|
|
Allocating cache for init task
|
|
--------------------------------------------------------------------------------
|
|
|
|
After this we can see the call of the `arch_task_cache_init` function in the `fork_init`:
|
|
|
|
```C
|
|
void arch_task_cache_init(void)
|
|
{
|
|
task_xstate_cachep =
|
|
kmem_cache_create("task_xstate", xstate_size,
|
|
__alignof__(union thread_xstate),
|
|
SLAB_PANIC | SLAB_NOTRACK, NULL);
|
|
setup_xstate_comp();
|
|
}
|
|
```
|
|
|
|
The `arch_task_cache_init` does initialization of the architecture-specific caches. In our case it is `x86_64`, so as we can see, the `arch_task_cache_init` allocates cache for the `task_xstate` which represents [FPU](http://en.wikipedia.org/wiki/Floating-point_unit) state and sets up offsets and sizes of all extended states in [xsave](http://www.felixcloutier.com/x86/XSAVES.html) area with the call of the `setup_xstate_comp` function. After the `arch_task_cache_init` we calculate default maximum number of threads with the:
|
|
|
|
```C
|
|
set_max_threads(MAX_THREADS);
|
|
```
|
|
|
|
where default maximum number of threads is:
|
|
|
|
```C
|
|
#define FUTEX_TID_MASK 0x3fffffff
|
|
#define MAX_THREADS FUTEX_TID_MASK
|
|
```
|
|
|
|
In the end of the `fork_init` function we initialize [signal](http://www.win.tue.nl/~aeb/linux/lk/lk-5.html) handler:
|
|
|
|
```C
|
|
init_task.signal->rlim[RLIMIT_NPROC].rlim_cur = max_threads/2;
|
|
init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/2;
|
|
init_task.signal->rlim[RLIMIT_SIGPENDING] =
|
|
init_task.signal->rlim[RLIMIT_NPROC];
|
|
```
|
|
|
|
As we know the `init_task` is an instance of the `task_struct` structure, so it contains `signal` field which represents signal handler. It has following type `struct signal_struct`. On the first two lines we can see setting of the current and maximum limit of the `resource limits`. Every process has an associated set of resource limits. These limits specify amount of resources which current process can use. Here `rlim` is resource control limit and presented by the:
|
|
|
|
```C
|
|
struct rlimit {
|
|
__kernel_ulong_t rlim_cur;
|
|
__kernel_ulong_t rlim_max;
|
|
};
|
|
```
|
|
|
|
structure from the [include/uapi/linux/resource.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/uapi/linux/resource.h). In our case the resource is the `RLIMIT_NPROC` which is the maximum number of processes that user can own and `RLIMIT_SIGPENDING` - the maximum number of pending signals. We can see it in the:
|
|
|
|
```C
|
|
cat /proc/self/limits
|
|
Limit Soft Limit Hard Limit Units
|
|
...
|
|
...
|
|
...
|
|
Max processes 63815 63815 processes
|
|
Max pending signals 63815 63815 signals
|
|
...
|
|
...
|
|
...
|
|
```
|
|
|
|
Initialization of the caches
|
|
--------------------------------------------------------------------------------
|
|
|
|
The next function after the `fork_init` is the `proc_caches_init` from the [kernel/fork.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/fork.c). This function allocates caches for the memory descriptors (or `mm_struct` structure). At the beginning of the `proc_caches_init` we can see allocation of the different [SLAB](http://en.wikipedia.org/wiki/Slab_allocation) caches with the call of the `kmem_cache_create`:
|
|
|
|
* `sighand_cachep` - manage information about installed signal handlers;
|
|
* `signal_cachep` - manage information about process signal descriptor;
|
|
* `files_cachep` - manage information about opened files;
|
|
* `fs_cachep` - manage filesystem information.
|
|
|
|
After this we allocate `SLAB` cache for the `mm_struct` structures:
|
|
|
|
```C
|
|
mm_cachep = kmem_cache_create("mm_struct",
|
|
sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
|
|
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL);
|
|
```
|
|
|
|
|
|
After this we allocate `SLAB` cache for the important `vm_area_struct` which used by the kernel to manage virtual memory space:
|
|
|
|
```C
|
|
vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC);
|
|
```
|
|
|
|
Note, that we use `KMEM_CACHE` macro here instead of the `kmem_cache_create`. This macro is defined in the [include/linux/slab.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/slab.h) and just expands to the `kmem_cache_create` call:
|
|
|
|
```C
|
|
#define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\
|
|
sizeof(struct __struct), __alignof__(struct __struct),\
|
|
(__flags), NULL)
|
|
```
|
|
|
|
The `KMEM_CACHE` has one difference from `kmem_cache_create`. Take a look on `__alignof__` operator. The `KMEM_CACHE` macro aligns `SLAB` to the size of the given structure, but `kmem_cache_create` uses given value to align space. After this we can see the call of the `mmap_init` and `nsproxy_cache_init` functions. The first function initializes virtual memory area `SLAB` and the second function initializes `SLAB` for namespaces.
|
|
|
|
The next function after the `proc_caches_init` is `buffer_init`. This function is defined in the [fs/buffer.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/fs/buffer.c) source code file and allocate cache for the `buffer_head`. The `buffer_head` is a special structure which defined in the [include/linux/buffer_head.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/buffer_head.h) and used for managing buffers. In the start of the `buffer_init` function we allocate cache for the `struct buffer_head` structures with the call of the `kmem_cache_create` function as we did in the previous functions. And calculate the maximum size of the buffers in memory with:
|
|
|
|
```C
|
|
nrpages = (nr_free_buffer_pages() * 10) / 100;
|
|
max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head));
|
|
```
|
|
|
|
which will be equal to the `10%` of the `ZONE_NORMAL` (all RAM from the 4GB on the `x86_64`). The next function after the `buffer_init` is - `vfs_caches_init`. This function allocates `SLAB` caches and hashtable for different [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) caches. We already saw the `vfs_caches_init_early` function in the eighth part of the linux kernel [initialization process](https://proninyaroslav.gitbooks.io/linux-insides-ru/content/Initialization/linux-initialization-8.html) which initialized caches for `dcache` (or directory-cache) and [inode](http://en.wikipedia.org/wiki/Inode) cache. The `vfs_caches_init` function makes post-early initialization of the `dcache` and `inode` caches, private data cache, hash tables for the mount points, etc. More details about [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) will be described in the separate part. After this we can see `signals_init` function. This function is defined in the [kernel/signal.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/signal.c) and allocates a cache for the `sigqueue` structures which represents queue of the real time signals. The next function is `page_writeback_init`. This function initializes the ratio for the dirty pages. Every low-level page entry contains the `dirty` bit which indicates whether a page has been written to after been loaded into memory.
|
|
|
|
Creation of the root for the procfs
|
|
--------------------------------------------------------------------------------
|
|
|
|
After all of this preparations we need to create the root for the [proc](http://en.wikipedia.org/wiki/Procfs) filesystem. We will do it with the call of the `proc_root_init` function from the [fs/proc/root.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/fs/proc/root.c). At the start of the `proc_root_init` function we allocate the cache for the inodes and register a new filesystem in the system with the:
|
|
|
|
```C
|
|
err = register_filesystem(&proc_fs_type);
|
|
if (err)
|
|
return;
|
|
```
|
|
|
|
As I wrote above we will not dive into details about [VFS](http://en.wikipedia.org/wiki/Virtual_file_system) and different filesystems in this chapter, but will see it in the chapter about the `VFS`. After we've registered a new filesystem in our system, we call the `proc_self_init` function from the [fs/proc/self.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/fs/proc/self.c) and this function allocates `inode` number for the `self` (`/proc/self` directory refers to the process accessing the `/proc` filesystem). The next step after the `proc_self_init` is `proc_setup_thread_self` which setups the `/proc/thread-self` directory which contains information about current thread. After this we create `/proc/self/mounts` symlink which will contains mount points with the call of the
|
|
|
|
```C
|
|
proc_symlink("mounts", NULL, "self/mounts");
|
|
```
|
|
|
|
and a couple of directories depends on the different configuration options:
|
|
|
|
```C
|
|
#ifdef CONFIG_SYSVIPC
|
|
proc_mkdir("sysvipc", NULL);
|
|
#endif
|
|
proc_mkdir("fs", NULL);
|
|
proc_mkdir("driver", NULL);
|
|
proc_mkdir("fs/nfsd", NULL);
|
|
#if defined(CONFIG_SUN_OPENPROMFS) || defined(CONFIG_SUN_OPENPROMFS_MODULE)
|
|
proc_mkdir("openprom", NULL);
|
|
#endif
|
|
proc_mkdir("bus", NULL);
|
|
...
|
|
...
|
|
...
|
|
if (!proc_mkdir("tty", NULL))
|
|
return;
|
|
proc_mkdir("tty/ldisc", NULL);
|
|
...
|
|
...
|
|
...
|
|
```
|
|
|
|
In the end of the `proc_root_init` we call the `proc_sys_init` function which creates `/proc/sys` directory and initializes the [Sysctl](http://en.wikipedia.org/wiki/Sysctl).
|
|
|
|
It is the end of `start_kernel` function. I did not describe all functions which are called in the `start_kernel`. I skipped them, because they are not important for the generic kernel initialization stuff and depend on only different kernel configurations. They are `taskstats_init_early` which exports per-task statistic to the user-space, `delayacct_init` - initializes per-task delay accounting, `key_init` and `security_init` initialize different security stuff, `check_bugs` - fix some architecture-dependent bugs, `ftrace_init` function executes initialization of the [ftrace](https://www.kernel.org/doc/Documentation/trace/ftrace.txt), `cgroup_init` makes initialization of the rest of the [cgroup](http://en.wikipedia.org/wiki/Cgroups) subsystem,etc. Many of these parts and subsystems will be described in the other chapters.
|
|
|
|
That's all. Finally we have passed through the long-long `start_kernel` function. But it is not the end of the linux kernel initialization process. We haven't run the first process yet. In the end of the `start_kernel` we can see the last call of the - `rest_init` function. Let's go ahead.
|
|
|
|
First steps after the start_kernel
|
|
--------------------------------------------------------------------------------
|
|
|
|
The `rest_init` function is defined in the same source code file as `start_kernel` function, and this file is [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c). In the beginning of the `rest_init` we can see call of the two following functions:
|
|
|
|
```C
|
|
rcu_scheduler_starting();
|
|
smpboot_thread_init();
|
|
```
|
|
|
|
The first `rcu_scheduler_starting` makes [RCU](http://en.wikipedia.org/wiki/Read-copy-update) scheduler active and the second `smpboot_thread_init` registers the `smpboot_thread_notifier` CPU notifier (more about it you can read in the [CPU hotplug documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt). After this we can see the following calls:
|
|
|
|
```C
|
|
kernel_thread(kernel_init, NULL, CLONE_FS);
|
|
pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
|
|
```
|
|
|
|
Here the `kernel_thread` function (defined in the [kernel/fork.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/fork.c)) creates new kernel thread.As we can see the `kernel_thread` function takes three arguments:
|
|
|
|
* Function which will be executed in a new thread;
|
|
* Parameter for the `kernel_init` function;
|
|
* Flags.
|
|
|
|
We will not dive into details about `kernel_thread` implementation (we will see it in the chapter which describe scheduler, just need to say that `kernel_thread` invokes [clone](http://www.tutorialspoint.com/unix_system_calls/clone.htm)). Now we only need to know that we create new kernel thread with `kernel_thread` function, parent and child of the thread will use shared information about filesystem and it will start to execute `kernel_init` function. A kernel thread differs from a user thread that it runs in kernel mode. So with these two `kernel_thread` calls we create two new kernel threads with the `PID = 1` for `init` process and `PID = 2` for `kthreadd`. We already know what is `init` process. Let's look on the `kthreadd`. It is a special kernel thread which manages and helps different parts of the kernel to create another kernel thread. We can see it in the output of the `ps` util:
|
|
|
|
```C
|
|
$ ps -ef | grep kthreadd
|
|
root 2 0 0 Jan11 ? 00:00:00 [kthreadd]
|
|
```
|
|
|
|
Let's postpone `kernel_init` and `kthreadd` for now and go ahead in the `rest_init`. In the next step after we have created two new kernel threads we can see the following code:
|
|
|
|
```C
|
|
rcu_read_lock();
|
|
kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
|
|
rcu_read_unlock();
|
|
```
|
|
|
|
The first `rcu_read_lock` function marks the beginning of an [RCU](http://en.wikipedia.org/wiki/Read-copy-update) read-side critical section and the `rcu_read_unlock` marks the end of an RCU read-side critical section. We call these functions because we need to protect the `find_task_by_pid_ns`. The `find_task_by_pid_ns` returns pointer to the `task_struct` by the given pid. So, here we are getting the pointer to the `task_struct` for `PID = 2` (we got it after `kthreadd` creation with the `kernel_thread`). In the next step we call `complete` function
|
|
|
|
```C
|
|
complete(&kthreadd_done);
|
|
```
|
|
|
|
and pass address of the `kthreadd_done`. The `kthreadd_done` defined as
|
|
|
|
```C
|
|
static __initdata DECLARE_COMPLETION(kthreadd_done);
|
|
```
|
|
|
|
where `DECLARE_COMPLETION` macro defined as:
|
|
|
|
```C
|
|
#define DECLARE_COMPLETION(work) \
|
|
struct completion work = COMPLETION_INITIALIZER(work)
|
|
```
|
|
|
|
and expands to the definition of the `completion` structure. This structure is defined in the [include/linux/completion.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/completion.h) and presents `completions` concept. Completions is a code synchronization mechanism which provides race-free solution for the threads that must wait for some process to have reached a point or a specific state. Using completions consists of three parts: The first is definition of the `complete` structure and we did it with the `DECLARE_COMPLETION`. The second is call of the `wait_for_completion`. After the call of this function, a thread which called it will not continue to execute and will wait while other thread did not call `complete` function. Note that we call `wait_for_completion` with the `kthreadd_done` in the beginning of the `kernel_init_freeable`:
|
|
|
|
```C
|
|
wait_for_completion(&kthreadd_done);
|
|
```
|
|
|
|
And the last step is to call `complete` function as we saw it above. After this the `kernel_init_freeable` function will not be executed while `kthreadd` thread will not be set. After the `kthreadd` was set, we can see three following functions in the `rest_init`:
|
|
|
|
```C
|
|
init_idle_bootup_task(current);
|
|
schedule_preempt_disabled();
|
|
cpu_startup_entry(CPUHP_ONLINE);
|
|
```
|
|
|
|
The first `init_idle_bootup_task` function from the [kernel/sched/core.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/sched/core.c) sets the Scheduling class for the current process (`idle` class in our case):
|
|
|
|
```C
|
|
void init_idle_bootup_task(struct task_struct *idle)
|
|
{
|
|
idle->sched_class = &idle_sched_class;
|
|
}
|
|
```
|
|
|
|
where `idle` class is a low task priority and tasks can be run only when the processor doesn't have anything to run besides this tasks. The second function `schedule_preempt_disabled` disables preempt in `idle` tasks. And the third function `cpu_startup_entry` is defined in the [kernel/sched/idle.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/sched/idle.c) and calls `cpu_idle_loop` from the [kernel/sched/idle.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/sched/idle.c). The `cpu_idle_loop` function works as process with `PID = 0` and works in the background. Main purpose of the `cpu_idle_loop` is to consume the idle CPU cycles. When there is no process to run, this process starts to work. We have one process with `idle` scheduling class (we just set the `current` task to the `idle` with the call of the `init_idle_bootup_task` function), so the `idle` thread does not do useful work but just checks if there is an active task to switch to:
|
|
|
|
```C
|
|
static void cpu_idle_loop(void)
|
|
{
|
|
...
|
|
...
|
|
...
|
|
while (1) {
|
|
while (!need_resched()) {
|
|
...
|
|
...
|
|
...
|
|
}
|
|
...
|
|
}
|
|
```
|
|
|
|
More about it will be in the chapter about scheduler. So for this moment the `start_kernel` calls the `rest_init` function which spawns an `init` (`kernel_init` function) process and become `idle` process itself. Now is time to look on the `kernel_init`. Execution of the `kernel_init` function starts from the call of the `kernel_init_freeable` function. The `kernel_init_freeable` function first of all waits for the completion of the `kthreadd` setup. I already wrote about it above:
|
|
|
|
```C
|
|
wait_for_completion(&kthreadd_done);
|
|
```
|
|
|
|
After this we set `gfp_allowed_mask` to `__GFP_BITS_MASK` which means that system is already running, set allowed [cpus/mems](https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt) to all CPUs and [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access) nodes with the `set_mems_allowed` function, allow `init` process to run on any CPU with the `set_cpus_allowed_ptr`, set pid for the `cad` or `Ctrl-Alt-Delete`, do preparation for booting of the other CPUs with the call of the `smp_prepare_cpus`, call early [initcalls](http://kernelnewbies.org/Documents/InitcallMechanism) with the `do_pre_smp_initcalls`, initialize `SMP` with the `smp_init` and initialize [lockup_detector](https://www.kernel.org/doc/Documentation/lockup-watchdogs.txt) with the call of the `lockup_detector_init` and initialize scheduler with the `sched_init_smp`.
|
|
|
|
After this we can see the call of the following functions - `do_basic_setup`. Before we will call the `do_basic_setup` function, our kernel already initialized for this moment. As comment says:
|
|
|
|
```
|
|
Now we can finally start doing some real work..
|
|
```
|
|
|
|
The `do_basic_setup` will reinitialize [cpuset](https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt) to the active CPUs, initialize the `khelper` - which is a kernel thread which used for making calls out to userspace from within the kernel, initialize [tmpfs](http://en.wikipedia.org/wiki/Tmpfs), initialize `drivers` subsystem, enable the user-mode helper `workqueue` and make post-early call of the `initcalls`. We can see opening of the `dev/console` and dup twice file descriptors from `0` to `2` after the `do_basic_setup`:
|
|
|
|
|
|
```C
|
|
if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
|
|
pr_err("Warning: unable to open an initial console.\n");
|
|
|
|
(void) sys_dup(0);
|
|
(void) sys_dup(0);
|
|
```
|
|
|
|
We are using two system calls here `sys_open` and `sys_dup`. In the next chapters we will see explanation and implementation of the different system calls. After we opened initial console, we check that `rdinit=` option was passed to the kernel command line or set default path of the ramdisk:
|
|
|
|
```C
|
|
if (!ramdisk_execute_command)
|
|
ramdisk_execute_command = "/init";
|
|
```
|
|
|
|
Check user's permissions for the `ramdisk` and call the `prepare_namespace` function from the [init/do_mounts.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/do_mounts.c) which checks and mounts the [initrd](http://en.wikipedia.org/wiki/Initrd):
|
|
|
|
```C
|
|
if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) {
|
|
ramdisk_execute_command = NULL;
|
|
prepare_namespace();
|
|
}
|
|
```
|
|
|
|
This is the end of the `kernel_init_freeable` function and we need return to the `kernel_init`. The next step after the `kernel_init_freeable` finished its execution is the `async_synchronize_full`. This function waits until all asynchronous function calls have been done and after it we will call the `free_initmem` which will release all memory occupied by the initialization stuff which located between `__init_begin` and `__init_end`. After this we protect `.rodata` with the `mark_rodata_ro` and update state of the system from the `SYSTEM_BOOTING` to the
|
|
|
|
```C
|
|
system_state = SYSTEM_RUNNING;
|
|
```
|
|
|
|
And tries to run the `init` process:
|
|
|
|
```C
|
|
if (ramdisk_execute_command) {
|
|
ret = run_init_process(ramdisk_execute_command);
|
|
if (!ret)
|
|
return 0;
|
|
pr_err("Failed to execute %s (error %d)\n",
|
|
ramdisk_execute_command, ret);
|
|
}
|
|
```
|
|
|
|
First of all it checks the `ramdisk_execute_command` which we set in the `kernel_init_freeable` function and it will be equal to the value of the `rdinit=` kernel command line parameters or `/init` by default. The `run_init_process` function fills the first element of the `argv_init` array:
|
|
|
|
```C
|
|
static const char *argv_init[MAX_INIT_ARGS+2] = { "init", NULL, };
|
|
```
|
|
|
|
which represents arguments of the `init` program and call `do_execve` function:
|
|
|
|
```C
|
|
argv_init[0] = init_filename;
|
|
return do_execve(getname_kernel(init_filename),
|
|
(const char __user *const __user *)argv_init,
|
|
(const char __user *const __user *)envp_init);
|
|
```
|
|
|
|
The `do_execve` function is defined in the [include/linux/sched.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/sched.h) and runs program with the given file name and arguments. If we did not pass `rdinit=` option to the kernel command line, kernel starts to check the `execute_command` which is equal to value of the `init=` kernel command line parameter:
|
|
|
|
```C
|
|
if (execute_command) {
|
|
ret = run_init_process(execute_command);
|
|
if (!ret)
|
|
return 0;
|
|
panic("Requested init %s failed (error %d).",
|
|
execute_command, ret);
|
|
}
|
|
```
|
|
|
|
If we did not pass `init=` kernel command line parameter either, kernel tries to run one of the following executable files:
|
|
|
|
```C
|
|
if (!try_to_run_init_process("/sbin/init") ||
|
|
!try_to_run_init_process("/etc/init") ||
|
|
!try_to_run_init_process("/bin/init") ||
|
|
!try_to_run_init_process("/bin/sh"))
|
|
return 0;
|
|
```
|
|
|
|
Otherwise we finish with [panic](http://en.wikipedia.org/wiki/Kernel_panic):
|
|
|
|
```C
|
|
panic("No working init found. Try passing init= option to kernel. "
|
|
"See Linux Documentation/init.txt for guidance.");
|
|
```
|
|
|
|
That's all! Linux kernel initialization process is finished!
|
|
|
|
Conclusion
|
|
--------------------------------------------------------------------------------
|
|
|
|
It is the end of the tenth part about the linux kernel [initialization process](https://proninyaroslav.gitbooks.io/linux-insides-ru/content/Initialization/index.html). It is not only the `tenth` part, but also is the last part which describes initialization of the linux kernel. As I wrote in the first [part](https://proninyaroslav.gitbooks.io/linux-insides-ru/content/Initialization/linux-initialization-1.html) of this chapter, we will go through all steps of the kernel initialization and we did it. We started at the first architecture-independent function - `start_kernel` and finished with the launch of the first `init` process in the our system. I skipped details about different subsystem of the kernel, for example I almost did not cover scheduler, interrupts, exception handling, etc. From the next part we will start to dive to the different kernel subsystems. Hope it will be interesting.
|
|
|
|
If you have any questions or suggestions write me a comment or ping me at [twitter](https://twitter.com/0xAX).
|
|
|
|
**Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
|
|
|
Links
|
|
--------------------------------------------------------------------------------
|
|
|
|
* [SLAB](http://en.wikipedia.org/wiki/Slab_allocation)
|
|
* [xsave](http://www.felixcloutier.com/x86/XSAVES.html)
|
|
* [FPU](http://en.wikipedia.org/wiki/Floating-point_unit)
|
|
* [Documentation/security/credentials.txt](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/security/credentials.txt)
|
|
* [Documentation/x86/x86_64/mm](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/x86/x86_64/mm.txt)
|
|
* [RCU](http://en.wikipedia.org/wiki/Read-copy-update)
|
|
* [VFS](http://en.wikipedia.org/wiki/Virtual_file_system)
|
|
* [inode](http://en.wikipedia.org/wiki/Inode)
|
|
* [proc](http://en.wikipedia.org/wiki/Procfs)
|
|
* [man proc](http://linux.die.net/man/5/proc)
|
|
* [Sysctl](http://en.wikipedia.org/wiki/Sysctl)
|
|
* [ftrace](https://www.kernel.org/doc/Documentation/trace/ftrace.txt)
|
|
* [cgroup](http://en.wikipedia.org/wiki/Cgroups)
|
|
* [CPU hotplug documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)
|
|
* [completions - wait for completion handling](https://www.kernel.org/doc/Documentation/scheduler/completion.txt)
|
|
* [NUMA](http://en.wikipedia.org/wiki/Non-uniform_memory_access)
|
|
* [cpus/mems](https://www.kernel.org/doc/Documentation/cgroups/cpusets.txt)
|
|
* [initcalls](http://kernelnewbies.org/Documents/InitcallMechanism)
|
|
* [Tmpfs](http://en.wikipedia.org/wiki/Tmpfs)
|
|
* [initrd](http://en.wikipedia.org/wiki/Initrd)
|
|
* [panic](http://en.wikipedia.org/wiki/Kernel_panic)
|
|
* [Previous part](https://proninyaroslav.gitbooks.io/linux-insides-ru/content/Initialization/linux-initialization-9.html)
|