mirror of
https://github.com/0xAX/linux-insides.git
synced 2024-12-22 22:58:08 +00:00
Merge pull request #363 from olshevskiy87/fix_typos_misc_syscall
fix typos: misc and syscall chapters
This commit is contained in:
commit
3723c112d3
@ -129,7 +129,7 @@ SUBARCH := $(shell uname -m | sed -e s/i.86/x86/ -e s/x86_64/x86/ \
|
||||
-e s/sh[234].*/sh/ -e s/aarch64.*/arm64/ )
|
||||
```
|
||||
|
||||
As you can see, it executes the [uname](https://en.wikipedia.org/wiki/Uname) util that prints information about machine, operating system and architecture. As it gets the output of `uname`, it parses the ouput and assigns the result to the `SUBARCH` variable. Now that we have `SUBARCH`, we set the `SRCARCH` variable that provides the directory of the certain architecture and `hfr-arch` that provides the directory for the header files:
|
||||
As you can see, it executes the [uname](https://en.wikipedia.org/wiki/Uname) util that prints information about machine, operating system and architecture. As it gets the output of `uname`, it parses the output and assigns the result to the `SUBARCH` variable. Now that we have `SUBARCH`, we set the `SRCARCH` variable that provides the directory of the certain architecture and `hfr-arch` that provides the directory for the header files:
|
||||
|
||||
```Makefile
|
||||
ifeq ($(ARCH),i386)
|
||||
|
@ -7,7 +7,7 @@ How does the Linux kernel handle a system call
|
||||
The previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) was the first part of the chapter that describes the [system call](https://en.wikipedia.org/wiki/System_call) concepts in the Linux kernel.
|
||||
In the previous part we learned what a system call is in the Linux kernel, and in operating systems in general. This was introduced from a user-space perspective, and part of the [write](http://man7.org/linux/man-pages/man2/write.2.html) system call implementation was discussed. In this part we continue our look at system calls, starting with some theory before moving onto the Linux kernel code.
|
||||
|
||||
An user application does not make the system call directly from our applications. We did not write the `Hello world!` program like:
|
||||
A user application does not make the system call directly from our applications. We did not write the `Hello world!` program like:
|
||||
|
||||
```C
|
||||
int main(int argc, char **argv)
|
||||
@ -135,16 +135,16 @@ wrmsrl(MSR_STAR, ((u64)__USER32_CS)<<48 | ((u64)__KERNEL_CS)<<32);
|
||||
wrmsrl(MSR_LSTAR, entry_SYSCALL_64);
|
||||
```
|
||||
|
||||
The first model specific register - `MSR_STAR` contains `63:48` bits of the user code segment. These bits will be loaded to the `CS` and `SS` segment registers for the `sysret` instruction which provides functionality to return from a system call to user code with the related privilege. Also the `MSR_STAR` contains `47:32` bits from the kernel code that will be used as the base selector for `CS` and `SS` segment registers when user space applications execute a system call. In the second line of code we fill the `MSR_LSTAR` register with the `entry_SYSCALL_64` symbol that represents system call entry. The `entry_SYSCALL_64` is defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file and contains code related to the preparation peformed before a system call handler will be executed (I already wrote about these preparations, read above). We will not consider the `entry_SYSCALL_64` now, but will return to it later in this chapter.
|
||||
The first model specific register - `MSR_STAR` contains `63:48` bits of the user code segment. These bits will be loaded to the `CS` and `SS` segment registers for the `sysret` instruction which provides functionality to return from a system call to user code with the related privilege. Also the `MSR_STAR` contains `47:32` bits from the kernel code that will be used as the base selector for `CS` and `SS` segment registers when user space applications execute a system call. In the second line of code we fill the `MSR_LSTAR` register with the `entry_SYSCALL_64` symbol that represents system call entry. The `entry_SYSCALL_64` is defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file and contains code related to the preparation performed before a system call handler will be executed (I already wrote about these preparations, read above). We will not consider the `entry_SYSCALL_64` now, but will return to it later in this chapter.
|
||||
|
||||
After we have set the entry point for system calls, we need to set the following model specific registers:
|
||||
|
||||
* `MSR_CSTAR` - target `rip` for the compability mode callers;
|
||||
* `MSR_CSTAR` - target `rip` for the compatibility mode callers;
|
||||
* `MSR_IA32_SYSENTER_CS` - target `cs` for the `sysenter` instruction;
|
||||
* `MSR_IA32_SYSENTER_ESP` - target `esp` for the `sysenter` instruction;
|
||||
* `MSR_IA32_SYSENTER_EIP` - target `eip` for the `sysenter` instruction.
|
||||
|
||||
The values of these model specific register depend on the `CONFIG_IA32_EMULATION` kernel configuration option. If this kernel configuration option is enabled, it allows legacy 32-bit programs to run under a 64-bit kernel. In the first case, if the `CONFIG_IA32_EMULATION` kernel configuration option is enabled, we fill these model specific registers with the entry point for the system calls the compability mode:
|
||||
The values of these model specific register depend on the `CONFIG_IA32_EMULATION` kernel configuration option. If this kernel configuration option is enabled, it allows legacy 32-bit programs to run under a 64-bit kernel. In the first case, if the `CONFIG_IA32_EMULATION` kernel configuration option is enabled, we fill these model specific registers with the entry point for the system calls the compatibility mode:
|
||||
|
||||
```C
|
||||
wrmsrl(MSR_CSTAR, entry_SYSCALL_compat);
|
||||
@ -191,7 +191,7 @@ wrmsrl(MSR_SYSCALL_MASK,
|
||||
X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT);
|
||||
```
|
||||
|
||||
These flags will be cleared during syscall initialization. That's all, it is the end of the `syscall_init` function and it means that system call entry is ready to work. Now we can see what will occur when an user application executes the `syscall` instruction.
|
||||
These flags will be cleared during syscall initialization. That's all, it is the end of the `syscall_init` function and it means that system call entry is ready to work. Now we can see what will occur when a user application executes the `syscall` instruction.
|
||||
|
||||
Preparation before system call handler will be called
|
||||
--------------------------------------------------------------------------------
|
||||
@ -364,7 +364,7 @@ In the end we just call the `USERGS_SYSRET64` macro that expands to the call of
|
||||
sysretq;
|
||||
```
|
||||
|
||||
Now we know what occurs when an user application calls a system call. The full path of this process is as follows:
|
||||
Now we know what occurs when a user application calls a system call. The full path of this process is as follows:
|
||||
|
||||
* User application contains code that fills general purposer register with the values (system call number and arguments of this system call);
|
||||
* Processor switches from the user mode to kernel mode and starts execution of the system call entry - `entry_SYSCALL_64`;
|
||||
|
@ -55,7 +55,7 @@ As we can see, at the beginning of the `map_vsyscall` function we get the physic
|
||||
ffffffff81881000 D __vsyscall_page
|
||||
```
|
||||
|
||||
in the `.data..page_aligned, aw` [section](https://en.wikipedia.org/wiki/Memory_segmentation) and contains call of the three folowing system calls:
|
||||
in the `.data..page_aligned, aw` [section](https://en.wikipedia.org/wiki/Memory_segmentation) and contains call of the three following system calls:
|
||||
|
||||
* `gettimeofday`;
|
||||
* `time`;
|
||||
@ -105,7 +105,7 @@ enum fixed_addresses {
|
||||
...
|
||||
```
|
||||
|
||||
It equal to the `511`. The second argument is the physical address of the the page that has to be mapped and the third argument is the flags of the page. Note that the flags of the `VSYSCALL_PAGE` depend on the `vsyscall_mode` variable. It will be `PAGE_KERNEL_VSYSCALL` if the `vsyscall_mode` variable is `NATIVE` and the `PAGE_KERNEL_VVAR` otherwise. Both macros (the `PAGE_KERNEL_VSYSCALL` and the `PAGE_KERNEL_VVAR`) will be expanded to the following flags:
|
||||
It equal to the `511`. The second argument is the physical address of the page that has to be mapped and the third argument is the flags of the page. Note that the flags of the `VSYSCALL_PAGE` depend on the `vsyscall_mode` variable. It will be `PAGE_KERNEL_VSYSCALL` if the `vsyscall_mode` variable is `NATIVE` and the `PAGE_KERNEL_VVAR` otherwise. Both macros (the `PAGE_KERNEL_VSYSCALL` and the `PAGE_KERNEL_VVAR`) will be expanded to the following flags:
|
||||
|
||||
```C
|
||||
#define __PAGE_KERNEL_VSYSCALL (__PAGE_KERNEL_RX | _PAGE_USER)
|
||||
@ -168,7 +168,7 @@ __vsyscall_page:
|
||||
ret
|
||||
```
|
||||
|
||||
And the start address of the `vsyscall` page is the `ffffffffff600000` everytime. So, the [glibc](https://en.wikipedia.org/wiki/GNU_C_Library) knows the addresses of the all virutal system call handlers. You can find definition of these addresses in the `glibc` source code:
|
||||
And the start address of the `vsyscall` page is the `ffffffffff600000` everytime. So, the [glibc](https://en.wikipedia.org/wiki/GNU_C_Library) knows the addresses of the all virtual system call handlers. You can find definition of these addresses in the `glibc` source code:
|
||||
|
||||
```C
|
||||
#define VSYSCALL_ADDR_vgettimeofday 0xffffffffff600000
|
||||
@ -180,7 +180,7 @@ All virtual system call requests will fall into the `__vsyscall_page` + `VSYSCAL
|
||||
|
||||
In the second case, if we pass `vsyscall=emulate` parameter to the kernel command line, an attempt to perform virtual system call handler will cause a [page fault](https://en.wikipedia.org/wiki/Page_fault) exception. Of course, remember, the `vsyscall` page has `__PAGE_KERNEL_VVAR` access rights that forbid execution. The `do_page_fault` function is the `#PF` or page fault handler. It tries to understand the reason of the last page fault. And one of the reason can be situation when virtual system call called and `vsyscall` mode is `emulate`. In this case `vsyscall` will be handled by the `emulate_vsyscall` function that defined in the [arch/x86/entry/vsyscall/vsyscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_64.c) source code file.
|
||||
|
||||
The `emulate_vsyscall` function gets the number of a virtual system call, checks it, prints error and sends [segementation fault](https://en.wikipedia.org/wiki/Segmentation_fault) single:
|
||||
The `emulate_vsyscall` function gets the number of a virtual system call, checks it, prints error and sends [segmentation fault](https://en.wikipedia.org/wiki/Segmentation_fault) single:
|
||||
|
||||
```C
|
||||
...
|
||||
@ -284,7 +284,7 @@ extern const struct vdso_image vdso_image_x32;
|
||||
#endif
|
||||
```
|
||||
|
||||
If our kernel is configured for the `x86` architecture or for the `x86_64` and compability mode, we will have ability to call a system call with the `int 0x80` interrupt, if compability mode is enabled, we will be able to call a system call with the native `syscall instruction` or `sysenter` instruction in other way:
|
||||
If our kernel is configured for the `x86` architecture or for the `x86_64` and compatibility mode, we will have ability to call a system call with the `int 0x80` interrupt, if compatibility mode is enabled, we will be able to call a system call with the native `syscall instruction` or `sysenter` instruction in other way:
|
||||
|
||||
```C
|
||||
#if defined CONFIG_X86_32 || defined CONFIG_COMPAT
|
||||
@ -395,7 +395,7 @@ Links
|
||||
* [BUILD_BUG_ON](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html)
|
||||
* [Processor register](https://en.wikipedia.org/wiki/Processor_register)
|
||||
* [Page fault](https://en.wikipedia.org/wiki/Page_fault)
|
||||
* [segementation fault](https://en.wikipedia.org/wiki/Segmentation_fault)
|
||||
* [segmentation fault](https://en.wikipedia.org/wiki/Segmentation_fault)
|
||||
* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter)
|
||||
* [stack pointer](https://en.wikipedia.org/wiki/Stack_register)
|
||||
* [uname](https://en.wikipedia.org/wiki/Uname)
|
||||
|
@ -16,7 +16,7 @@ This part will be last part in this chapter and as you can understand from the p
|
||||
how do we launch our programs?
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
There are many different ways to launch an application from an user perspective. For example we can run a program from the [shell](https://en.wikipedia.org/wiki/Unix_shell) or double-click on the application icon. It does not matter. The Linux kernel handles application launch regardless how we do launch this application.
|
||||
There are many different ways to launch an application from a user perspective. For example we can run a program from the [shell](https://en.wikipedia.org/wiki/Unix_shell) or double-click on the application icon. It does not matter. The Linux kernel handles application launch regardless how we do launch this application.
|
||||
|
||||
In this part we will consider the way when we just launch an application from the shell. As you know, the standard way to launch an application from shell is the following: We just launch a [terminal emulator](https://en.wikipedia.org/wiki/Terminal_emulator) application and just write the name of the program and pass or not arguments to our program, for example:
|
||||
|
||||
@ -68,12 +68,12 @@ $ strace uname
|
||||
execve("/bin/uname", ["uname"], [/* 62 vars */]) = 0
|
||||
```
|
||||
|
||||
So, an user application (`bash` in our case) calls the system call and as we already know the next step is Linux kernel.
|
||||
So, a user application (`bash` in our case) calls the system call and as we already know the next step is Linux kernel.
|
||||
|
||||
execve system call
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
We saw preparation before a system call called by an user application and after a system call handler finished its work in the second [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) of this chapter. We stopped at the call of the `execve` system call in the previous paragraph. This system call defined in the [fs/exec.c](https://github.com/torvalds/linux/blob/master/fs/exec.c) source code file and as we already know it takes three arguments:
|
||||
We saw preparation before a system call called by a user application and after a system call handler finished its work in the second [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) of this chapter. We stopped at the call of the `execve` system call in the previous paragraph. This system call defined in the [fs/exec.c](https://github.com/torvalds/linux/blob/master/fs/exec.c) source code file and as we already know it takes three arguments:
|
||||
|
||||
```
|
||||
SYSCALL_DEFINE3(execve,
|
||||
@ -161,7 +161,7 @@ The `sched_exec` function is used to determine the least loaded processor that c
|
||||
|
||||
After this we need to check [file descriptor](https://en.wikipedia.org/wiki/File_descriptor) of the give executable binary. We try to check does the name of the our binary file starts from the `/` symbol or does the path of the given executable binary is interpreted relative to the current working directory of the calling process or in other words file descriptor is `AT_FDCWD` (read above about this).
|
||||
|
||||
If one of these checks is successfull we set the binary parameter filename:
|
||||
If one of these checks is successful we set the binary parameter filename:
|
||||
|
||||
```C
|
||||
bprm->file = file;
|
||||
@ -227,7 +227,7 @@ if (retval < 0)
|
||||
goto out;
|
||||
```
|
||||
|
||||
fills the `linux_binprm` structure with the `uid` from [inode](https://en.wikipedia.org/wiki/Inode) and read `128` bytes from the binary executable file. We read only first `128` from the executable file because we need to check a type of our executable. We will read the rest of the executable file in the later step. After the preparation of the `linux_bprm` structure we copy the filename of the executable binary file, command line arguments and enviroment variables to the `linux_bprm` with the call of the `copy_strings_kernel` function:
|
||||
fills the `linux_binprm` structure with the `uid` from [inode](https://en.wikipedia.org/wiki/Inode) and read `128` bytes from the binary executable file. We read only first `128` from the executable file because we need to check a type of our executable. We will read the rest of the executable file in the later step. After the preparation of the `linux_bprm` structure we copy the filename of the executable binary file, command line arguments and environment variables to the `linux_bprm` with the call of the `copy_strings_kernel` function:
|
||||
|
||||
```C
|
||||
retval = copy_strings_kernel(1, &bprm->filename, bprm);
|
||||
@ -249,7 +249,7 @@ And set the pointer to the top of new program's stack that we set in the `bprm_m
|
||||
bprm->exec = bprm->p;
|
||||
```
|
||||
|
||||
The top of the stack will contain the program filename and we store this fileneme tothe `exec` field of the `linux_bprm` structure.
|
||||
The top of the stack will contain the program filename and we store this filename to the `exec` field of the `linux_bprm` structure.
|
||||
|
||||
Now we have filled `linux_bprm` structure, we call the `exec_binprm` function:
|
||||
|
||||
@ -277,7 +277,7 @@ search_binary_handler(bprm);
|
||||
function. This function goes through the list of handlers that contains different binary formats. Currently the Linux kernel supports following binary formats:
|
||||
|
||||
* `binfmt_script` - support for interpreted scripts that are starts from the [#!](https://en.wikipedia.org/wiki/Shebang_%28Unix%29) line;
|
||||
* `binfmt_misc` - support differnt binary formats, according to runtime configuration of the Linux kernel;
|
||||
* `binfmt_misc` - support different binary formats, according to runtime configuration of the Linux kernel;
|
||||
* `binfmt_elf` - support [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) format;
|
||||
* `binfmt_aout` - support [a.out](https://en.wikipedia.org/wiki/A.out) format;
|
||||
* `binfmt_flat` - support for [flat](https://en.wikipedia.org/wiki/Binary_file#Structure) format;
|
||||
@ -394,7 +394,7 @@ That's all. From this point our programm will be executed.
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the fourth and last part of the about the system calls concept in the Linux kernel. We saw almost all related stuff to the `system call` concept in these four parts. We started from the understanding of the `system call` concept, we have learned what is it and why do users applications need in this concept. Next we saw how does the Linux handle a system call from an user application. We met two similar concepts to the `system call` concept, they are `vsyscall` and `vDSO` and finally we saw how does Linux kernel run an user program.
|
||||
This is the end of the fourth and last part of the about the system calls concept in the Linux kernel. We saw almost all related stuff to the `system call` concept in these four parts. We started from the understanding of the `system call` concept, we have learned what is it and why do users applications need in this concept. Next we saw how does the Linux handle a system call from a user application. We met two similar concepts to the `system call` concept, they are `vsyscall` and `vDSO` and finally we saw how does Linux kernel run a user program.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user