1
0
mirror of https://github.com/0xAX/linux-insides.git synced 2025-01-20 12:41:09 +00:00

Merge pull request #11 from 0xAX/master

merge commits
This commit is contained in:
慕冬亮 2015-09-16 14:44:11 +08:00
commit 58f930ebb8
9 changed files with 499 additions and 59 deletions

View File

@ -206,7 +206,7 @@ hdr:
boot_flag: .word 0xAA55 boot_flag: .word 0xAA55
``` ```
The bootloader must fill this and the rest of the headers (only marked as `write` in the Linux boot protocol, for example [this](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt#L354)) with values which it either got from command line or calculated. We will not see a description and explanation of all fields of the kernel setup header, we will get back to that when the kernel uses them. You can find a description of all fielsd in the [boot protocol](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt#L156). The bootloader must fill this and the rest of the headers (only marked as `write` in the Linux boot protocol, for example [this](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt#L354)) with values which it either got from command line or calculated. We will not see a description and explanation of all fields of the kernel setup header, we will get back to that when the kernel uses them. You can find a description of all fields in the [boot protocol](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt#L156).
As we can see in the kernel boot protocol, the memory map will be the following after loading the kernel: As we can see in the kernel boot protocol, the memory map will be the following after loading the kernel:

View File

@ -301,7 +301,7 @@ call __libc_start_main
Here we pass address of the entry point to the `.init` and `.fini` section that contain code that starts to execute when the program is ran and the code that executes when program terminates. And in the end we see the call of the `main` function from our program. These three symbols are defined in the [csu/elf-init.c](https://sourceware.org/git/?p=glibc.git;a=blob;f=csu/elf-init.c;hb=1d4bbc54bd4f7d85d774871341b49f4357af1fb7) source code file. The following two object files: Here we pass address of the entry point to the `.init` and `.fini` section that contain code that starts to execute when the program is ran and the code that executes when program terminates. And in the end we see the call of the `main` function from our program. These three symbols are defined in the [csu/elf-init.c](https://sourceware.org/git/?p=glibc.git;a=blob;f=csu/elf-init.c;hb=1d4bbc54bd4f7d85d774871341b49f4357af1fb7) source code file. The following two object files:
* `crtn.o`; * `crtn.o`;
* `crtn.i`. * `crti.o`.
define the function prologs/epilogs for the .init and .fini sections (with the `_init` and `_fini` symbols respectively). define the function prologs/epilogs for the .init and .fini sections (with the `_init` and `_fini` symbols respectively).

View File

@ -14,6 +14,12 @@ Support
[![Flattr linux-insides](https://img.shields.io/badge/donate-flattr-green.svg)](https://flattr.com/submit/auto?user_id=0xAX&url=https://github.com/0xAX/linux-insides/&title=linux-insed) [![Support at gratipay](http://img.shields.io/gratipay/0xAX.svg)](https://gratipay.com/0xAX/) [![Support with bitcoin](https://img.shields.io/badge/donate-bitcoin-green.svg)](https://www.coinbase.com/checkouts/0bfa452a41cf52c0b3f99500b4f31685) [![Support via gitbook](https://img.shields.io/badge/donate-gitbook-green.svg)](https://gumroad.com/l/gitbook_54c9232c1db1670300055523?wanted=true) [![Flattr linux-insides](https://img.shields.io/badge/donate-flattr-green.svg)](https://flattr.com/submit/auto?user_id=0xAX&url=https://github.com/0xAX/linux-insides/&title=linux-insed) [![Support at gratipay](http://img.shields.io/gratipay/0xAX.svg)](https://gratipay.com/0xAX/) [![Support with bitcoin](https://img.shields.io/badge/donate-bitcoin-green.svg)](https://www.coinbase.com/checkouts/0bfa452a41cf52c0b3f99500b4f31685) [![Support via gitbook](https://img.shields.io/badge/donate-gitbook-green.svg)](https://gumroad.com/l/gitbook_54c9232c1db1670300055523?wanted=true)
On other languages
-------------------
* [Chinese](https://github.com/xinqiu/linux-insides)
* [Spanish](https://github.com/leolas95/linux-insides)
LICENSE LICENSE
------------- -------------

View File

@ -28,13 +28,14 @@
* [Initialization of external hardware interrupts structures](interrupts/interrupts-8.md) * [Initialization of external hardware interrupts structures](interrupts/interrupts-8.md)
* [Softirq, Tasklets and Workqueues](interrupts/interrupts-9.md) * [Softirq, Tasklets and Workqueues](interrupts/interrupts-9.md)
* [Last part](interrupts/interrupts-10.md) * [Last part](interrupts/interrupts-10.md)
* [Memory management](mm/README.md)
* [Memblock](mm/linux-mm-1.md)
* [Fixmaps and ioremap](mm/linux-mm-2.md)
* [System calls](SysCall/README.md) * [System calls](SysCall/README.md)
* [Introduction to system calls](SysCall/syscall-1.md) * [Introduction to system calls](SysCall/syscall-1.md)
* [How the Linux kernel handles a system call](SysCall/syscall-2.md) * [How the Linux kernel handles a system call](SysCall/syscall-2.md)
* [vsyscall and vDSO](SysCall/syscall-3.md) * [vsyscall and vDSO](SysCall/syscall-3.md)
* [How the Linux kernel runs a program](SysCall/syscall-4.md)
* [Memory management](mm/README.md)
* [Memblock](mm/linux-mm-1.md)
* [Fixmaps and ioremap](mm/linux-mm-2.md)
* [SMP]() * [SMP]()
* [Concepts](Concepts/README.md) * [Concepts](Concepts/README.md)
* [Per-CPU variables](Concepts/per-cpu.md) * [Per-CPU variables](Concepts/per-cpu.md)

View File

@ -6,3 +6,4 @@ couple of posts which describe the full cycle of the kernel loading process:
* [Introduction to system call concept](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) - this part is introduction to the `system call` concept in the Linux kernel. * [Introduction to system call concept](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) - this part is introduction to the `system call` concept in the Linux kernel.
* [How the Linux kernel handles a system call](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) - this part describes how the Linux kernel handles a system call from an userspace application. * [How the Linux kernel handles a system call](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) - this part describes how the Linux kernel handles a system call from an userspace application.
* [vsyscall and vDSO](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) - third part describes `vsyscall` and `vDSO` concepts. * [vsyscall and vDSO](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) - third part describes `vsyscall` and `vDSO` concepts.
* [How the Linux kernel runs a program](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-3.html) - this part describes startup process of a program.

View File

@ -296,10 +296,10 @@ The second macro `__SYSCALL_DEFINEx` expands to the definition of the five follo
static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__)) static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__))
``` ```
The first `sys##name` is definition of the syscall handler function with the given name - `sys_system_call_name`. The `__SC_DECL` macro takes the `__VA_ARGS__` and combines call input parameter system type and the parameter name, because the macro definition is unable to determine the parameter types. And the `__MAP` macro applyes `__SC_DECL` macro to the `__VA_ARGS__` arguments. The other functions that are generated by the `__SYSCALL_DEFINEx` macro are need to protect from the [CVE-2009-0029](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2009-0029) and we will not dive into details about this here. Ok, as result of the `SYSCALL_DEFINE3` macro, we will have: The first `sys##name` is definition of the syscall handler function with the given name - `sys_system_call_name`. The `__SC_DECL` macro takes the `__VA_ARGS__` and combines call input parameter system type and the parameter name, because the macro definition is unable to determine the parameter types. And the `__MAP` macro applies `__SC_DECL` macro to the `__VA_ARGS__` arguments. The other functions that are generated by the `__SYSCALL_DEFINEx` macro are need to protect from the [CVE-2009-0029](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2009-0029) and we will not dive into details about this here. Ok, as result of the `SYSCALL_DEFINE3` macro, we will have:
```C ```C
asmlinkage long sys_write(unsigned int fd, const char __user * filename, size_t count); asmlinkage long sys_write(unsigned int fd, const char __user * buf, size_t count);
``` ```
Now we know a little about the system call's definition and we can go back to the implementation of the `write` system call. Let's look on the implementation of this system call again: Now we know a little about the system call's definition and we can go back to the implementation of the `write` system call. Let's look on the implementation of this system call again:

View File

@ -11,7 +11,7 @@ We already know what is a `system call`. This is special routine in the Linux ke
Introduction to vsyscalls Introduction to vsyscalls
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
The `vsyscall` or `virtual system call` is the first and older mechinism in the Linux kernel that designed to accelerate execution of the certain system calls. The principle of work of the `vsyscall` concept is simple. The Linux kernel maps into user space a page that contains some variables and the implementation of some system calls. We can find information about this memeory space in the Linux kernel [documentation](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt) for the [x86_64](https://en.wikipedia.org/wiki/X86-64): The `vsyscall` or `virtual system call` is the first and oldest mechinism in the Linux kernel that is designed to accelerate execution of certain system calls. The principle of work of the `vsyscall` concept is simple. The Linux kernel maps into user space a page that contains some variables and the implementation of some system calls. We can find information about this memory space in the Linux kernel [documentation](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt) for the [x86_64](https://en.wikipedia.org/wiki/X86-64):
``` ```
ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
@ -24,7 +24,7 @@ or:
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
``` ```
After this, these these system calls will be executed in userpsace and this means that there will not be [context switching](https://en.wikipedia.org/wiki/Context_switch). Mapping of the `vsyscall` page occurs in the `map_vsyscall` function that defined in the [arch/x86/entry/vsyscall/vsyscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_64.c) source code file. This function is called during the Linux kernel intialization in the `setup_arch` function that defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file (we saw this function in the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) of the Linux kernel initialization process chapter). After this, these system calls will be executed in userspace and this means that there will not be [context switching](https://en.wikipedia.org/wiki/Context_switch). Mapping of the `vsyscall` page occurs in the `map_vsyscall` function that is defined in the [arch/x86/entry/vsyscall/vsyscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_64.c) source code file. This function is called during the Linux kernel intialization in the `setup_arch` function that is defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file (we saw this function in the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) of the Linux kernel initialization process chapter).
Note that implementation of the `map_vsyscall` function depends on the `CONFIG_X86_VSYSCALL_EMULATION` kernel configuration option: Note that implementation of the `map_vsyscall` function depends on the `CONFIG_X86_VSYSCALL_EMULATION` kernel configuration option:
@ -36,7 +36,7 @@ static inline void map_vsyscall(void) {}
#endif #endif
``` ```
As we can read in the help text, the `CONFIG_X86_VSYSCALL_EMULATION` configuration option: `Enable vsyscall emulation`. Why to emulate `vsyscall`? Actuall, the `vsyscall` is are a legacy [ABI](https://en.wikipedia.org/wiki/Application_binary_interface) by the security reasons. Virtual system calls have fixed addresses that means that `vsyscall` page is still at the same location everytime and the localtion of this page determined in the `map_vsyscall` function. Let's look on the implementation of this function: As we can read in the help text, the `CONFIG_X86_VSYSCALL_EMULATION` configuration option: `Enable vsyscall emulation`. Why emulate `vsyscall`? Actually, the `vsyscall` is a legacy [ABI](https://en.wikipedia.org/wiki/Application_binary_interface) due to the security reasons. Virtual system calls have fixed addresses, meaning that `vsyscall` page is still at the same location every time and the location of this page is determined in the `map_vsyscall` function. Let's look on the implementation of this function:
```C ```C
void __init map_vsyscall(void) void __init map_vsyscall(void)
@ -49,7 +49,7 @@ void __init map_vsyscall(void)
} }
``` ```
As we can see, at the beginning of the `map_vsyscall` function we gets the physical address of the `vsyscall` page with the `__pa_symbol` macro (we already saw implementation if this macro in the fourth [path](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process). The `__vsyscall_page` symbol definied in the [arch/x86/entry/vsyscall/vsyscall_emu_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_emu_64.S) assembly source code file and have the following [virtual address](https://en.wikipedia.org/wiki/Virtual_address_space): As we can see, at the beginning of the `map_vsyscall` function we get the physical address of the `vsyscall` page with the `__pa_symbol` macro (we already saw implementation if this macro in the fourth [path](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-4.html) of the Linux kernel initialization process). The `__vsyscall_page` symbol defined in the [arch/x86/entry/vsyscall/vsyscall_emu_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_emu_64.S) assembly source code file and have the following [virtual address](https://en.wikipedia.org/wiki/Virtual_address_space):
``` ```
ffffffff81881000 D __vsyscall_page ffffffff81881000 D __vsyscall_page
@ -80,7 +80,7 @@ __vsyscall_page:
ret ret
``` ```
Let's go back to the implementation of the `map_vsyscall` function, later we will return to the implementation of the `__vsyscall_page`. After we got the physical address of the `__vsyscall_page`, we check the value of the `vsyscall_mode` variable and sets the [fix-mapped](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html) address for the `vsyscall` page with the `__set_fixmap` macro: Let's go back to the implementation of the `map_vsyscall` function and return to the implementation of the `__vsyscall_page`, later. After we receiving the physical address of the `__vsyscall_page`, we check the value of the `vsyscall_mode` variable and set the [fix-mapped](http://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html) address for the `vsyscall` page with the `__set_fixmap` macro:
```C ```C
if (vsyscall_mode != NONE) if (vsyscall_mode != NONE)
@ -105,14 +105,14 @@ enum fixed_addresses {
... ...
``` ```
It equal to the `511`. The second argument is the physical address of the the page that has to be mapped and the third argument is the flags of the page. Note that flags of the `VSYSCALL_PAGE` depends on the `vsyscall_mode` variable. It will be `PAGE_KERNEL_VSYSCALL` if the `vsyscall_mode` variable is `NATIVE` and the `PAGE_KERNEL_VVAR` in other way. Both macros (the `PAGE_KERNEL_VSYSCALL` and the `PAGE_KERNEL_VVAR`) will be expanded to the following flags: It equal to the `511`. The second argument is the physical address of the the page that has to be mapped and the third argument is the flags of the page. Note that the flags of the `VSYSCALL_PAGE` depend on the `vsyscall_mode` variable. It will be `PAGE_KERNEL_VSYSCALL` if the `vsyscall_mode` variable is `NATIVE` and the `PAGE_KERNEL_VVAR` otherwise. Both macros (the `PAGE_KERNEL_VSYSCALL` and the `PAGE_KERNEL_VVAR`) will be expanded to the following flags:
```C ```C
#define __PAGE_KERNEL_VSYSCALL (__PAGE_KERNEL_RX | _PAGE_USER) #define __PAGE_KERNEL_VSYSCALL (__PAGE_KERNEL_RX | _PAGE_USER)
#define __PAGE_KERNEL_VVAR (__PAGE_KERNEL_RO | _PAGE_USER) #define __PAGE_KERNEL_VVAR (__PAGE_KERNEL_RO | _PAGE_USER)
``` ```
that represent access rights to the `vsyscall` page. Both flags have the same `_PAGE_USER` flags that means that the page can be accessed by a user-mode process running at lower privilege levels. And the second flag depends on the value of the `vsyscall_mode` variable. The first flag (`__PAGE_KERNEL_VSYSCALL`) will be set in a case if the `vsyscall_mode` will be `NATIVE`. This means virtual system calls will be native `syscall` instructions. In other way the vsyscall will have `PAGE_KERNEL_VVAR` if the `vsyscall_mode` variable will be `emulate`. In this case virtual system calls will be turned into traps and are emulated reasonably. The `vsyscall_mode` variable gets its value in the `vsyscall_setup` function: that represent access rights to the `vsyscall` page. Both flags have the same `_PAGE_USER` flags that means that the page can be accessed by a user-mode process running at lower privilege levels. The second flag depends on the value of the `vsyscall_mode` variable. The first flag (`__PAGE_KERNEL_VSYSCALL`) will be set in the case where `vsyscall_mode` is `NATIVE`. This means virtual system calls will be native `syscall` instructions. In other way the vsyscall will have `PAGE_KERNEL_VVAR` if the `vsyscall_mode` variable will be `emulate`. In this case virtual system calls will be turned into traps and are emulated reasonably. The `vsyscall_mode` variable gets its value in the `vsyscall_setup` function:
```C ```C
static int __init vsyscall_setup(char *str) static int __init vsyscall_setup(char *str)
@ -149,7 +149,7 @@ BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
(unsigned long)VSYSCALL_ADDR); (unsigned long)VSYSCALL_ADDR);
``` ```
That's all. `vsyscall` page is set up. The result of the all the above is following: If we pass `vsyscall=native` parameter to the kernel command line, virtual system calls will be handled as native `syscall` instructions in the [arch/x86/entry/vsyscall/vsyscall_emu_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_emu_64.S). The [glibc](https://en.wikipedia.org/wiki/GNU_C_Library) knows addresses of the virtual system call handlers. Note that virtual system call handlers aligned by `1024` (or `0x400`) bytes: That's all. `vsyscall` page is set up. The result of the all the above is the following: If we pass `vsyscall=native` parameter to the kernel command line, virtual system calls will be handled as native `syscall` instructions in the [arch/x86/entry/vsyscall/vsyscall_emu_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_emu_64.S). The [glibc](https://en.wikipedia.org/wiki/GNU_C_Library) knows addresses of the virtual system call handlers. Note that virtual system call handlers are aligned by `1024` (or `0x400`) bytes:
```assembly ```assembly
__vsyscall_page: __vsyscall_page:
@ -168,7 +168,7 @@ __vsyscall_page:
ret ret
``` ```
And the start address of the `vsyscall` page is the `ffffffffff600000` everytime. So, the [glibc](https://en.wikipedia.org/wiki/GNU_C_Library) knows addresses of the all virutal system call handlers. You can find definition of these addresses in the `glibc` source code: And the start address of the `vsyscall` page is the `ffffffffff600000` everytime. So, the [glibc](https://en.wikipedia.org/wiki/GNU_C_Library) knows the addresses of the all virutal system call handlers. You can find definition of these addresses in the `glibc` source code:
```C ```C
#define VSYSCALL_ADDR_vgettimeofday 0xffffffffff600000 #define VSYSCALL_ADDR_vgettimeofday 0xffffffffff600000
@ -178,7 +178,7 @@ And the start address of the `vsyscall` page is the `ffffffffff600000` everytime
All virtual system call requests will fall into the `__vsyscall_page` + `VSYSCALL_ADDR_vsyscall_name` offset, put the number of a virtual system call to the `rax` general purpose [register](https://en.wikipedia.org/wiki/Processor_register) and the native for the x86_64 `syscall` instruction will be executed. All virtual system call requests will fall into the `__vsyscall_page` + `VSYSCALL_ADDR_vsyscall_name` offset, put the number of a virtual system call to the `rax` general purpose [register](https://en.wikipedia.org/wiki/Processor_register) and the native for the x86_64 `syscall` instruction will be executed.
In the second case, if we pass `vsyscall=emulate` parameter to the kernel command line, attempt to perform virtual system call handler will cause [page fault](https://en.wikipedia.org/wiki/Page_fault) exception. Of course, remember, the `vsyscall` page has `__PAGE_KERNEL_VVAR` access rights that forbid execution. The `do_page_fault` function is the `#PF` or page fault handler. It tries to understand the reason of the last page fault. And one of the reason can be situation when virtual system call called and `vsyscall` mode is `emulate`. In this case `vsyscall` will be handled by the `emulate_vsyscall` function that defined in the [arch/x86/entry/vsyscall/vsyscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_64.c) source code file. In the second case, if we pass `vsyscall=emulate` parameter to the kernel command line, an attempt to perform virtual system call handler will cause a [page fault](https://en.wikipedia.org/wiki/Page_fault) exception. Of course, remember, the `vsyscall` page has `__PAGE_KERNEL_VVAR` access rights that forbid execution. The `do_page_fault` function is the `#PF` or page fault handler. It tries to understand the reason of the last page fault. And one of the reason can be situation when virtual system call called and `vsyscall` mode is `emulate`. In this case `vsyscall` will be handled by the `emulate_vsyscall` function that defined in the [arch/x86/entry/vsyscall/vsyscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vsyscall/vsyscall_64.c) source code file.
The `emulate_vsyscall` function gets the number of a virtual system call, checks it, prints error and sends [segementation fault](https://en.wikipedia.org/wiki/Segmentation_fault) single: The `emulate_vsyscall` function gets the number of a virtual system call, checks it, prints error and sends [segementation fault](https://en.wikipedia.org/wiki/Segmentation_fault) single:
@ -230,7 +230,7 @@ That's all. Now let's look on the modern concept - `vDSO`.
Introduction to vDSO Introduction to vDSO
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
As I already wrote above, `vsyscall` is obsolete concept and replaced by the `vDSO` or `virtual dynamic shared object`. The main difference between the `vsyscall` and `vDSO` mechanisms that `vDSO` maps memory pages into each process in a shared object [form](https://en.wikipedia.org/wiki/Library_%28computing%29#Shared_libraries), but `vsyscall` is static in memory and has the same address everytime. For the `x86_64` architecture it is called -`linux-vdso.so.1`. All userspace applications linked with this shared library via the `glibc`. For example: As I already wrote above, `vsyscall` is an obsolete concept and replaced by the `vDSO` or `virtual dynamic shared object`. The main difference between the `vsyscall` and `vDSO` mechanisms is that `vDSO` maps memory pages into each process in a shared object [form](https://en.wikipedia.org/wiki/Library_%28computing%29#Shared_libraries), but `vsyscall` is static in memory and has the same address every time. For the `x86_64` architecture it is called -`linux-vdso.so.1`. All userspace applications linked with this shared library via the `glibc`. For example:
``` ```
~$ ldd /bin/uname ~$ ldd /bin/uname
@ -266,7 +266,7 @@ static int __init init_vdso(void)
#endif #endif
``` ```
Both function makes initialization of the `vdso_image` structures. This structures defined in the two generated sourece code files: the [arch/x86/entry/vdso/vdso-image-64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vdso-image-64.c) and the [arch/x86/entry/vdso/vdso-image-64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vdso-image-64.c). These source code files generated by the [vdso2c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vdso2c.c) programm from the different soure code files that represent different approaches to call a system call like `int 0x80`, `sysenter` and etc. The full set of the images depends on the kernel configuration. Both function initialize the `vdso_image` structure. This structure is defined in the two generated source code files: the [arch/x86/entry/vdso/vdso-image-64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vdso-image-64.c) and the [arch/x86/entry/vdso/vdso-image-64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vdso-image-64.c). These source code files generated by the [vdso2c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vdso2c.c) program from the different source code files, represent different approaches to call a system call like `int 0x80`, `sysenter` and etc. The full set of the images depends on the kernel configuration.
For example for the `x86_64` Linux kernel it will contain `vdso_image_64`: For example for the `x86_64` Linux kernel it will contain `vdso_image_64`:
@ -284,7 +284,7 @@ extern const struct vdso_image vdso_image_x32;
#endif #endif
``` ```
If our kernel will configured for the `x86` architecture or for the `x86_64` and compability mode, we will have ability to call a system call with the `int 0x80` interrupt, if compability mode will be enabled, we will be able to call a system call with the native `syscall instruction` or `sysenter` instruction in other way: If our kernel is configured for the `x86` architecture or for the `x86_64` and compability mode, we will have ability to call a system call with the `int 0x80` interrupt, if compability mode is enabled, we will be able to call a system call with the native `syscall instruction` or `sysenter` instruction in other way:
```C ```C
#if defined CONFIG_X86_32 || defined CONFIG_COMPAT #if defined CONFIG_X86_32 || defined CONFIG_COMPAT
@ -296,7 +296,7 @@ If our kernel will configured for the `x86` architecture or for the `x86_64` and
#endif #endif
``` ```
As we can understand from the name of the `vdso_image` structure, it represent image of the `vDSO` for the certain mode of the system call entry. This structure contains information about size in bytes of the `vDSO` area that always a multiple of `PAGE_SIZE` (`4096` bytes), pointer to the text mapping, start and end address of the `alternatives` (set of instructions with better alternatives for the certaint type of the processor) and etc. For example `vdso_image_64` looks like this: As we can understand from the name of the `vdso_image` structure, it represents image of the `vDSO` for the certain mode of the system call entry. This structure contains information about size in bytes of the `vDSO` area that always a multiple of `PAGE_SIZE` (`4096` bytes), pointer to the text mapping, start and end address of the `alternatives` (set of instructions with better alternatives for the certain type of the processor) and etc. For example `vdso_image_64` looks like this:
```C ```C
const struct vdso_image vdso_image_64 = { const struct vdso_image vdso_image_64 = {
@ -322,7 +322,7 @@ static struct page *pages[2];
or 8 Kilobytes. or 8 Kilobytes.
The `init_vdso_image` function defined in the same source code file and just initializes the `vdso_image.text_mapping.pages`. First of all this function calculates the number of pages and initializes each `vdso_image.text_mapping.pages[number_of_page]` with the `virt_to_page` macro that converts given address to the `page` structure: The `init_vdso_image` function is defined in the same source code file and just initializes the `vdso_image.text_mapping.pages`. First of all this function calculates the number of pages and initializes each `vdso_image.text_mapping.pages[number_of_page]` with the `virt_to_page` macro that converts given address to the `page` structure:
```C ```C
void __init init_vdso_image(const struct vdso_image *image) void __init init_vdso_image(const struct vdso_image *image)
@ -339,13 +339,13 @@ void __init init_vdso_image(const struct vdso_image *image)
} }
``` ```
The `init_vdso` function passed to the `subsys_initcall` macro that adds the given function to the `initcalls` list. All functions from this list will be called in the `do_initcalls` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file: The `init_vdso` function passed to the `subsys_initcall` macro adds the given function to the `initcalls` list. All functions from this list will be called in the `do_initcalls` function from the [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) source code file:
```C ```C
subsys_initcall(init_vdso); subsys_initcall(init_vdso);
``` ```
Ok, we just saw initialization of the `vDSO` and initialization of `page` structures that are related to the memory pages that contain `vDSO` system calls. But where do there pages mapped? Actually they are mapped by the kernel, when it loads binary to the memory. The Linux kernel calls the `arch_setup_additional_pages` function from the [arch/x86/entry/vdso/vma.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vma.c) source code file that checks that `vDSO` enabled for the `x86_64` and calls the `map_vdso` function: Ok, we just saw initialization of the `vDSO` and initialization of `page` structures that are related to the memory pages that contain `vDSO` system calls. But to where do their pages map? Actually they are mapped by the kernel, when it loads binary to the memory. The Linux kernel calls the `arch_setup_additional_pages` function from the [arch/x86/entry/vdso/vma.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/vdso/vma.c) source code file that checks that `vDSO` enabled for the `x86_64` and calls the `map_vdso` function:
```C ```C
int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
@ -357,25 +357,24 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
} }
``` ```
The `map_vdso` function defined in the same source code file and maps pages for the `vDSO` and for the shared `vDSO` variables. That's all. Main differences between the `vsyscall` and the `vDSO` concepts that first has static and each time the same address `ffffffffff600000` and the second loads dynamically and the second `vDSO` implements four system calls: The `map_vdso` function is defined in the same source code file and maps pages for the `vDSO` and for the shared `vDSO` variables. That's all. The main differences between the `vsyscall` and the `vDSO` concepts is that `vsyscal` has a static address of `ffffffffff600000` and implements `3` system calls, whereas the `vDSO` loads dynamically and implements four system calls:
* `__vdso_clock_gettime`; * `__vdso_clock_gettime`;
* `__vdso_getcpu`; * `__vdso_getcpu`;
* `__vdso_gettimeofday`; * `__vdso_gettimeofday`;
* `__vdso_time`. * `__vdso_time`.
when `vsyscall` only `3`.
That's all. That's all.
Conclusion Conclusion
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is the end of the third part about the system calls concept in the Linux kernel. In the previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) we implementation of the preparation from the Linux kernel side, before a system call will be handled and implementation of the `exit` process from a system call handler. In this part we continued to dive into the stuff which is related to the system call concept and learned to knew two concepts that are very similar to the system call - the `vsyscall` and the `vDSO`. This is the end of the third part about the system calls concept in the Linux kernel. In the previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) we discussed the implementation of the preparation from the Linux kernel side, before a system call will be handled and implementation of the `exit` process from a system call handler. In this part we continued to dive into the stuff which is related to the system call concept and learned two new concepts that are very similar to the system call - the `vsyscall` and the `vDSO`.
After all of these three parts, we know almost all things that are related to system calls, we know what is it system call and why do user applications need in they, what do occur when an user application calls system call and what does kernel handles system calls. After all of these three parts, we know almost all things that are related to system calls, we know what system call is and why user applications need them. We also know what occurs when a user application calls a system call and how the kernel handles system calls.
The next part will be last part in this [chapter](http://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) and we will see what occurs when a user runs the program. The next part will be the last part in this [chapter](http://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) and we will see what occurs when a user runs the program.
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-internals/issues/new). If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-internals/issues/new).

430
SysCall/syscall-4.md Normal file
View File

@ -0,0 +1,430 @@
System calls in the Linux kernel. Part 4.
================================================================================
How does the Linux kernel run a program
--------------------------------------------------------------------------------
This is the fourth part of the [chapter](http://0xax.gitbooks.io/linux-insides/content/SysCall/index.html) that describes [system calls](https://en.wikipedia.org/wiki/System_call) in the Linux kernel and as I wrote in the conclusion of the [previous](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-3.html) - this part will be last in this chapter. In the previous part we stopped at the two new concepts:
* `vsyscall`;
* `vDSO`;
that are related and very similar on system call concept.
This part will be last part in this chapter and as you can understand from the part's title - we will see what does occur in the Linux kernel when we run our programs. So, let's start.
how do we launch our programs?
--------------------------------------------------------------------------------
There are many different ways to launch an application from an user perspective. For example we can run a program from the [shell](https://en.wikipedia.org/wiki/Unix_shell) or double-click on the application icon. It does not matter. The Linux kernel handles application launch regardless how we do launch this application.
In this part we will consider the way when we just launch an application from the shell. As you know, the standard way to launch an application from shell is the following: We just launch a [terminal emulator](https://en.wikipedia.org/wiki/Terminal_emulator) application and just write the name of the program and pass or not arguments to our program, for example:
![ls shell](http://s14.postimg.org/d6jgidc7l/Screenshot_from_2015_09_07_17_31_55.png)
Let's consider what does occur when we launch an application from the shell, what does shell do when we write program name, what does Linux kernel do etc. But before we will start to consider these interesting things, I want to warn that this book is about the Linux kernel. That's why we will see Linux kernel internals related stuff mostly in this part. We will not consider in details what does shell do, we will not consider complex cases, for example subshells etc.
My default shell is - [bash](https://en.wikipedia.org/wiki/Bash_%28Unix_shell%29), so I will consider how do bash shell launches a program. So let's start. The `bash` shell as well as any program that written with [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) programming language starts from the [main](https://en.wikipedia.org/wiki/Entry_point) function. If you will look on the source code of the `bash` shell, you will find the `main` function in the [shell.c](https://github.com/bminor/bash/blob/master/shell.c#L357) source code file. This function makes many different things before the main thread loop of the `bash` started to work. For example this function:
* checks and tries to open `/dev/tty`;
* check that shell running in debug mode;
* parses command line arguments;
* reads shell environment;
* loads `.bashrc`, `.profile` and other configuration files;
* and many many more.
After all of these operations we can see the call of the `reader_loop` function. This function defined in the [eval.c](https://github.com/bminor/bash/blob/master/eval.c#L67) source code file and represents main thread loop or in other words it reads and executes commands. As the `reader_loop` function made all checks and read the given program name and arguments, it calls the `execute_command` function from the [execute_cmd.c](https://github.com/bminor/bash/blob/master/execute_cmd.c#L378) source code file. The `execute_command` function through the chain of the functions calls:
```
execute_command
--> execute_command_internal
----> execute_simple_command
------> execute_disk_command
--------> shell_execve
```
makes different checks like do we need to start `subshell`, was it builtin `bash` function or not etc. As I already wrote above, we will not consider all details about things that are not related to the Linux kernel. In the end of this process, the `shell_execve` function calls the `execve` system call:
```C
execve (command, args, env);
```
The `execve` system call has the following signature:
```
int execve(const char *filename, char *const argv [], char *const envp[]);
```
and executes a program by the given filename, with the given arguments and [environment variables](https://en.wikipedia.org/wiki/Environment_variable). This system call is the first in our case and only, for example:
```
$ strace ls
execve("/bin/ls", ["ls"], [/* 62 vars */]) = 0
$ strace echo
execve("/bin/echo", ["echo"], [/* 62 vars */]) = 0
$ strace uname
execve("/bin/uname", ["uname"], [/* 62 vars */]) = 0
```
So, an user application (`bash` in our case) calls the system call and as we already know the next step is Linux kernel.
execve system call
--------------------------------------------------------------------------------
We saw preparation before a system call called by an user application and after a system call handler finished its work in the second [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) of this chapter. We stopped at the call of the `execve` system call in the previous paragraph. This system call defined in the [fs/exec.c](https://github.com/torvalds/linux/blob/master/fs/exec.c) source code file and as we already know it takes three arguments:
```
SYSCALL_DEFINE3(execve,
const char __user *, filename,
const char __user *const __user *, argv,
const char __user *const __user *, envp)
{
return do_execve(getname(filename), argv, envp);
}
```
Implementation of the `execve` is pretty simple here, as we can see it just returns the result of the `do_execve` function. The `do_execve` function defined in the same source code file and do the following things:
* Initialize two pointers on a userspace data with the given arguments and environment variables;
* return the result of the `do_execveat_common`.
We can see its implementation:
```C
struct user_arg_ptr argv = { .ptr.native = __argv };
struct user_arg_ptr envp = { .ptr.native = __envp };
return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
```
The `do_execveat_common` function does main work - it executes a new program. This function takes similar set of arguments, but as you can see it takes five arguments instead of three. The first argument is the file descriptor that represent directory with our application, in our case the `AT_FDCWD` means that the given pathname is interpreted relative to the current working directory of the calling process. The fifth argument is flags. In our case we passed `0` to the `do_execveat_common`. We will check in a next step, so will see it latter.
First of all the `do_execveat_common` function checks the `filename` pointer and returns if it is `NULL`. After this we check flags of the current process that limit of running processes is not exceed:
```C
if (IS_ERR(filename))
return PTR_ERR(filename);
if ((current->flags & PF_NPROC_EXCEEDED) &&
atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
retval = -EAGAIN;
goto out_ret;
}
current->flags &= ~PF_NPROC_EXCEEDED;
```
If these two checks were successful we unset `PF_NPROC_EXCEEDED` flag in the flags of the current process to prevent fail of the `execve`. You can see that in the next step we call the `unshare_files` function that defined in the [kernel/fork.c](https://github.com/torvalds/linux/blob/master/kernel/fork.c) and unshares the files of the current task and check the result of this function:
```C
retval = unshare_files(&displaced);
if (retval)
goto out_ret;
```
We need to call this function to eliminate potential leak of the execve'd binary's [file descriptor](https://en.wikipedia.org/wiki/File_descriptor). In the next step we start preparation of the `bprm` that represented by the `struct linux_binprm` structure (defined in the [include/linux/binfmts.h](https://github.com/torvalds/linux/blob/master/linux/binfmts.h) header file). The `linux_binprm` structure is used to hold the arguments that are used when loading binaries. For example it contains `vma` field which has `vm_area_struct` type and represents single memory area over a contiguous interval in a given address space where our application will be loaded, `mm` field which is memory descriptor of the binary, pointer to the top of memory and many other different fields.
First of all we allocate memory for this structure with the `kzalloc` function and check the result of the allocation:
```C
bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
if (!bprm)
goto out_files;
```
After this we start to prepare the `binprm` credentials with the call of the `prepare_bprm_creds` function:
```C
retval = prepare_bprm_creds(bprm);
if (retval)
goto out_free;
check_unsafe_exec(bprm);
current->in_execve = 1;
```
Initialization of the `binprm` credentials in other words is initialization of the `cred` structure that stored inside of the `linux_binprm` structure. The `cred` structure contains the security context of a task for example [real uid](https://en.wikipedia.org/wiki/User_identifier#Real_user_ID) of the task, real [guid](https://en.wikipedia.org/wiki/Globally_unique_identifier) of the task, `uid` and `guid` for the [virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system) operations etc. In the next step as we executed preparation of the `bprm` credentials we check that now we can safely execute a program with the call of the `check_unsafe_exec` function and set the current process to the `in_execve` state.
After all of these operations we call the `do_open_execat` function that checks the flags that we passed to the `do_execveat_common` function (remember that we have `0` in the `flags`) and searches and opens executable file on disk, checks that our we will load a binary file from `noexec` mount points (we need to avoid execute a binary from filesystems that do not contain executable binaries like [proc](https://en.wikipedia.org/wiki/Procfs) or [sysfs](https://en.wikipedia.org/wiki/Sysfs)), intializes `file` structure and returns pointer on this structure. Next we can see the call the `sched_exec` after this:
```C
file = do_open_execat(fd, filename, flags);
retval = PTR_ERR(file);
if (IS_ERR(file))
goto out_unmark;
sched_exec();
```
The `sched_exec` function is used to determine the least loaded processor that can execute the new program and to migrate the current process to it.
After this we need to check [file descriptor](https://en.wikipedia.org/wiki/File_descriptor) of the give executable binary. We try to check does the name of the our binary file starts from the `/` symbol or does the path of the given executable binary is interpreted relative to the current working directory of the calling process or in other words file descriptor is `AT_FDCWD` (read above about this).
If one of these checks is successfull we set the binary parameter filename:
```C
bprm->file = file;
if (fd == AT_FDCWD || filename->name[0] == '/') {
bprm->filename = filename->name;
}
```
Otherwise if the filename is empty we set the binary parameter filename to the `/dev/fd/%d` or `/dev/fd/%d/%s` depends on the filename of the given executable binary which means that we will execute the file to which the file descriptor refers:
```C
} else {
if (filename->name[0] == '\0')
pathbuf = kasprintf(GFP_TEMPORARY, "/dev/fd/%d", fd);
else
pathbuf = kasprintf(GFP_TEMPORARY, "/dev/fd/%d/%s",
fd, filename->name);
if (!pathbuf) {
retval = -ENOMEM;
goto out_unmark;
}
bprm->filename = pathbuf;
}
bprm->interp = bprm->filename;
```
Note that we set not only the `bprm->filename` but also `bprm->interp` that will contain name of the program interpreter. For now we just write the same name there, but later it will be updated with the real name of the program interpreter depends on binary format of a program. You can read above that we already prepared `cred` for the `linux_binprm`. The next step is initalization of other fields of the `linux_binprm`. First of all we call the `bprm_mm_init` function and pass the `bprm` to it:
```C
retval = bprm_mm_init(bprm);
if (retval)
goto out_unmark;
```
The `bprm_mm_init` defined in the same source code file and as we can understand from the function's name, it makes initialization of the memory descriptor or in other words the `bprm_mm_init` function initializes `mm_struct` structure. This structure defined in the [include/linux/mm_types.h](https://github.com/torvalds/linux/blob/master/include/mm_types.h) header file and represents address space of a process. We will not consider implementation of the `bprm_mm_init` function because we do not know many important stuff related to the Linux kernel memory manager, but we just need to know that this function initializes `mm_struct` and populate it with a temporary stack `vm_area_struct`.
After this we calculate the count of the command line arguments which are were passed to the our executable binary, the count of the environment variables and set it to the `bprm->argc` and `bprm->envc` respectively:
```C
bprm->argc = count(argv, MAX_ARG_STRINGS);
if ((retval = bprm->argc) < 0)
goto out;
bprm->envc = count(envp, MAX_ARG_STRINGS);
if ((retval = bprm->envc) < 0)
goto out;
```
As you can see we do this operations with the help of the `count` function that defined in the [same](https://github.com/torvalds/linux/blob/master/fs/exec.c) source code file and calculates the count of strings in the `argv` array. The `MAX_ARG_STRINGS` macro defined in the [include/uapi/linux/binfmts.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/binfmts.h) header file and as we can understand from the macro's name, it represents maximum number of strings that were passed to the `execve` system call. The value of the `MAX_ARG_STRINGS`:
```C
#define MAX_ARG_STRINGS 0x7FFFFFFF
```
After we calculated the number of the command line arguments and environment variables, we call the `prepare_binprm` function. We already call the function with the similar name before this moment. This function is called `prepare_binprm_cred` and we remember that this function initializes `cred` structure in the `linux_bprm`. Now the `prepare_binprm` function:
```C
retval = prepare_binprm(bprm);
if (retval < 0)
goto out;
```
fills the `linux_binprm` structure with the `uid` from [inode](https://en.wikipedia.org/wiki/Inode) and read `128` bytes from the binary executable file. We read only first `128` from the executable file because we need to check a type of our executable. We will read the rest of the executable file in the later step. After the preparation of the `linux_bprm` structure we copy the filename of the executable binary file, command line arguments and enviroment variables to the `linux_bprm` with the call of the `copy_strings_kernel` function:
```C
retval = copy_strings_kernel(1, &bprm->filename, bprm);
if (retval < 0)
goto out;
retval = copy_strings(bprm->envc, envp, bprm);
if (retval < 0)
goto out;
retval = copy_strings(bprm->argc, argv, bprm);
if (retval < 0)
goto out;
```
And set the pointer to the top of new program's stack that we set in the `bprm_mm_init` function:
```C
bprm->exec = bprm->p;
```
The top of the stack will contain the program filename and we store this fileneme tothe `exec` field of the `linux_bprm` structure.
Now we have filled `linux_bprm` structure, we call the `exec_binprm` function:
```C
retval = exec_binprm(bprm);
if (retval < 0)
goto out;
```
First of all we store the [pid](https://en.wikipedia.org/wiki/Process_identifier) and `pid` that seen from the [namespace](https://en.wikipedia.org/wiki/Cgroups) of the current task in the `exec_binprm`:
```C
old_pid = current->pid;
rcu_read_lock();
old_vpid = task_pid_nr_ns(current, task_active_pid_ns(current->parent));
rcu_read_unlock();
```
and call the:
```C
search_binary_handler(bprm);
```
function. This function goes through the list of handlers that contains different binary formats. Currently the Linux kernel supports following binary formats:
* `binfmt_script` - support for interpreted scripts that are starts from the [#!](https://en.wikipedia.org/wiki/Shebang_%28Unix%29) line;
* `binfmt_misc` - support differnt binary formats, according to runtime configuration of the Linux kernel;
* `binfmt_elf` - support [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) format;
* `binfmt_aout` - support [a.out](https://en.wikipedia.org/wiki/A.out) format;
* `binfmt_flat` - support for [flat](https://en.wikipedia.org/wiki/Binary_file#Structure) format;
* `binfmt_elf_fdpic` - Support for [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) [FDPIC](http://elinux.org/UClinux_Shared_Library#FDPIC_ELF) binaries;
* `binfmt_em86` - support for Intel [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) binaries running on [Alpha](https://en.wikipedia.org/wiki/DEC_Alpha) machines.
So, the search-binary_handler tries to call the `load_binary` function and pass `linux_binprm` to it. If the binary handler supports the given executable file format, it starts to prepare the executable binary for execution:
```C
int search_binary_handler(struct linux_binprm *bprm)
{
...
...
...
list_for_each_entry(fmt, &formats, lh) {
retval = fmt->load_binary(bprm);
if (retval < 0 && !bprm->mm) {
force_sigsegv(SIGSEGV, current);
return retval;
}
}
return retval;
```
Where the `load_binary` for example for the [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) checks the magic number (each `elf` binary file contains magic number in the header) in the `linux_bprm` buffer (remember that we read first `128` bytes from the executable binary file): and exit if it is not `elf` binary:
```C
static int load_elf_binary(struct linux_binprm *bprm)
{
...
...
...
loc->elf_ex = *((struct elfhdr *)bprm->buf);
if (memcmp(elf_ex.e_ident, ELFMAG, SELFMAG) != 0)
goto out;
```
If the given executable file is in `elf` format, the `load_elf_binary` continues to execute. The `load_elf_binary` does many different things to prepare on execution executable file. For example it checks the architecture and type of the executable file:
```C
if (loc->elf_ex.e_type != ET_EXEC && loc->elf_ex.e_type != ET_DYN)
goto out;
if (!elf_check_arch(&loc->elf_ex))
goto out;
```
and exit if there is wrong architecture and executable file non executable non shared. Tries to load the `program header table`:
```C
elf_phdata = load_elf_phdrs(&loc->elf_ex, bprm->file);
if (!elf_phdata)
goto out;
```
that describes [segments](https://en.wikipedia.org/wiki/Memory_segmentation). Read the `program interpreter` and libraries that linked with the our executable binary file from disk and load it to memory. The `program interpreter` specified in the `.interp` section of the executable file and as you can read in the part that describes [Linkers](http://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html) it is - `/lib64/ld-linux-x86-64.so.2` for the `x86_64`. It setups the stack and map `elf` binary into the correct location in memory. It maps the [bss](https://en.wikipedia.org/wiki/.bss) and the [brk](http://man7.org/linux/man-pages/man2/sbrk.2.html) sections and does many many other different things to prepare executable file to execute.
In the end of the execution of the `load_elf_binary` we call the `start_thread` function and pass three arguments to it:
```C
start_thread(regs, elf_entry, bprm->p);
retval = 0;
out:
kfree(loc);
out_ret:
return retval;
```
These arguments are:
* Set of [registers](https://en.wikipedia.org/wiki/Processor_register) for the new task;
* Address of the entry point of the new task;
* Address of the top of the stack for the new task.
As we can understand from the function's name, it starts new thread, but it is not so. The `start_thread` function just prepares new task's registers to be ready to run. Let's look on the implementation of this function:
```C
void
start_thread(struct pt_regs *regs, unsigned long new_ip, unsigned long new_sp)
{
start_thread_common(regs, new_ip, new_sp,
__USER_CS, __USER_DS, 0);
}
```
As we can see the `start_thread` function just makes a call of the `start_thread_common` function that will do all for us:
```C
static void
start_thread_common(struct pt_regs *regs, unsigned long new_ip,
unsigned long new_sp,
unsigned int _cs, unsigned int _ss, unsigned int _ds)
{
loadsegment(fs, 0);
loadsegment(es, _ds);
loadsegment(ds, _ds);
load_gs_index(0);
regs->ip = new_ip;
regs->sp = new_sp;
regs->cs = _cs;
regs->ss = _ss;
regs->flags = X86_EFLAGS_IF;
force_iret();
}
```
The `start_thread_common` function fills `fs` segment register with zero and `es` and `ds` with the value of the data segment register. After this we set new values to the [instruction pointer](https://en.wikipedia.org/wiki/Program_counter), `cs` segments etc. In the end of the `start_thread_common` function we can see the `force_iret` macro that force a system call return via `iret` instruction. Ok, we prepared new thread to run in userspace and now we can return from the `exec_binprm` and now we are in the `do_execveat_common` again. After the `exec_binprm` will finish its execution we release memory for structures that was allocated before and return.
After we returned from the `execve` system call handler, execution of our program will be started. We can do it, because all context related information already configured for this purpose. As we saw the `execve` system call does not return control to a process, but code, data and other segments of the caller process are just overwritten of the program segments. The exit from our application will be implemented through the `exit` system call.
That's all. From this point our programm will be executed.
Conclusion
--------------------------------------------------------------------------------
This is the end of the fourth and last part of the about the system calls concept in the Linux kernel. We saw almost all related stuff to the `system call` concept in these four parts. We started from the understanding of the `system call` concept, we have learned what is it and why do users applications need in this concept. Next we saw how does the Linux handle a system call from an user application. We met two similar concepts to the `system call` concept, they are `vsyscall` and `vDSO` and finally we saw how does Linux kernel run an user program.
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-internals/issues/new).
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
Links
--------------------------------------------------------------------------------
* [System call](https://en.wikipedia.org/wiki/System_call)
* [shell](https://en.wikipedia.org/wiki/Unix_shell)
* [bash](https://en.wikipedia.org/wiki/Bash_%28Unix_shell%29)
* [entry point](https://en.wikipedia.org/wiki/Entry_point)
* [C](https://en.wikipedia.org/wiki/C_%28programming_language%29)
* [environment variables](https://en.wikipedia.org/wiki/Environment_variable)
* [file descriptor](https://en.wikipedia.org/wiki/File_descriptor)
* [real uid](https://en.wikipedia.org/wiki/User_identifier#Real_user_ID)
* [virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system)
* [procfs](https://en.wikipedia.org/wiki/Procfs)
* [sysfs](https://en.wikipedia.org/wiki/Sysfs)
* [inode](https://en.wikipedia.org/wiki/Inode)
* [pid](https://en.wikipedia.org/wiki/Process_identifier)
* [namespace](https://en.wikipedia.org/wiki/Cgroups)
* [#!](https://en.wikipedia.org/wiki/Shebang_%28Unix%29)
* [elf](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format)
* [a.out](https://en.wikipedia.org/wiki/A.out)
* [flat](https://en.wikipedia.org/wiki/Binary_file#Structure)
* [Alpha](https://en.wikipedia.org/wiki/DEC_Alpha)
* [FDPIC](http://elinux.org/UClinux_Shared_Library#FDPIC_ELF)
* [segments](https://en.wikipedia.org/wiki/Memory_segmentation)
* [Linkers](http://0xax.gitbooks.io/linux-insides/content/Misc/linkers.html)
* [Processor register](https://en.wikipedia.org/wiki/Processor_register)
* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter)
* [Previous part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-3.html)

View File

@ -1,32 +1,32 @@
Executable and Linkable Format Executable and Linkable Format
================================================================================ ================================================================================
ELF (Executable and Linkable Format) is a standard file format for executable files, object code, shared libraries, and core dumps. Linux, as well as, many other UNIX-like operating systems uses this format. Let's look on the structure of ELF-64 File Format and some defintions in the linux kernel source code related with it. ELF (Executable and Linkable Format) is a standard file format for executable files, object code, shared libraries and core dumps. Linux and many UNIX-like operating systems use this format. Let's look at the structure of the ELF-64 Object File Format and some definitions in the linux kernel source code which related with it.
An ELF file consists of the following parts: An ELF object file consists of the following parts:
* ELF header - describes the main characteristics of the object file: type, CPU architecture, virtual address of the entry point, size and offset of the remaining parts, etc...; * ELF header - describes the main characteristics of the object file: type, CPU architecture, the virtual address of the entry point, the size and offset of the remaining parts, etc...;
* Program header table - lists the available segments and their attributes. Program header table needs loaders for placing sections of this file as virtual memory segments; * Program header table - lists the available segments and their attributes. Program header table need loaders for placing sections of the file as virtual memory segments;
* Section header table - contains the description of sections. * Section header table - contains the description of the sections.
Now let's look closer on these components. Now let's have a closer look on these components.
**ELF header** **ELF header**
It's located in the beginning of the object file. Its main point is to locate all other parts of the object file. ELF header contains following fields: The ELF header is located at the beginning of the object file. Its main purpose is to locate all other parts of the object file. The File header contains the following fields:
* ELF identification - array of bytes which helps identify this file as an ELF file and also provides information about general object file characteristics; * ELF identification - array of bytes which helps identify the file as an ELF object file and also provides information about general object file characteristic;
* Object file type - identifies the object file type. This field can describe whether this file is a relocatable file or executable file, etc...; * Object file type - identifies the object file type. This field can describe that ELF file is a relocatable object file, an executable file, etc...;
* Target architecture; * Target architecture;
* Version of the object file format; * Version of the object file format;
* Virtual address of the program entry point; * Virtual address of the program entry point;
* File offset of the program header table; * File offset of the program header table;
* File offset of the section header table; * File offset of the section header table;
* Size of the ELF header; * Size of an ELF header;
* Size of the program header table entry; * Size of a program header table entry;
* and other fields... * and other fields...
You can find `elf64_hdr` structure which presents ELF64 header in the linux kernel source code: You can find the `elf64_hdr` structure which presents ELF64 header in the linux kernel source code:
```C ```C
typedef struct elf64_hdr { typedef struct elf64_hdr {
@ -47,11 +47,11 @@ typedef struct elf64_hdr {
} Elf64_Ehdr; } Elf64_Ehdr;
``` ```
This structure defines in the [elf.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/elf.h) This structure defined in the [elf.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/elf.h#L220)
**Sections** **Sections**
All data is stored in sections in an Elf file. Sections are identified by index in the section header table. Section header contains following fields: All data stores in a sections in an Elf object file. Sections identified by index in the section header table. Section header contains following fields:
* Section name; * Section name;
* Section type; * Section type;
@ -64,7 +64,7 @@ All data is stored in sections in an Elf file. Sections are identified by index
* Address alignment boundary; * Address alignment boundary;
* Size of entries, if section has table; * Size of entries, if section has table;
And presented with the following `elf64_shdr` structure in the linux kernel source code: And presented with the following `elf64_shdr` structure in the linux kernel:
```C ```C
typedef struct elf64_shdr { typedef struct elf64_shdr {
@ -81,9 +81,11 @@ typedef struct elf64_shdr {
} Elf64_Shdr; } Elf64_Shdr;
``` ```
[elf.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/elf.h#L312)
**Program header table** **Program header table**
All sections are grouped into segments in an executable file or shared library. Program header table is an array of structures which describe every segment. It looks like: All sections are grouped into segments in an executable or shared object file. Program header is an array of structures which describe every segment. It looks like:
```C ```C
typedef struct elf64_phdr { typedef struct elf64_phdr {
@ -98,14 +100,16 @@ typedef struct elf64_phdr {
} Elf64_Phdr; } Elf64_Phdr;
``` ```
`elf64_phdr` structure defines in the same [elf.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/elf.h). in the linux kernel source code.
And ELF file also contains other fields/structures which you can find in the [Documentation](http://www.uclibc.org/docs/elf-64-gen.pdf). Now let's look on the `vmlinux`. `elf64_phdr` defined in the same [elf.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/elf.h#L254).
The ELF object file also contains other fields/structures which you can find in the [Documentation](http://www.uclibc.org/docs/elf-64-gen.pdf). Now let's a look at the `vmlinux` ELF object.
vmlinux vmlinux
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
`vmlinux` is an ELF file too. So we can look at it with the `readelf` util. First of all, let's look on the elf header of vmlinux: `vmlinux` is also a relocatable ELF object file . We can take a look at it with the `readelf` util. First of all let's look at the header:
``` ```
$ readelf -h vmlinux $ readelf -h vmlinux
@ -131,15 +135,15 @@ ELF Header:
Section header string table index: 70 Section header string table index: 70
``` ```
Here we can see that `vmlinux` is 64-bit executable file. Here we can see that `vmlinux` is a 64-bit executable file.
We can read from the [Documentation/x86/x86_64/mm.txt](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt): We can read from the [Documentation/x86/x86_64/mm.txt](https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt#L19):
``` ```
ffffffff80000000 - ffffffffa0000000 (=512 MB) kernel text mapping, from phys 0 ffffffff80000000 - ffffffffa0000000 (=512 MB) kernel text mapping, from phys 0
``` ```
So we can find it in the `vmlinux` with: We can then look this address up in the `vmlinux` ELF object with:
``` ```
$ readelf -s vmlinux | grep ffffffff81000000 $ readelf -s vmlinux | grep ffffffff81000000
@ -148,9 +152,9 @@ $ readelf -s vmlinux | grep ffffffff81000000
90766: ffffffff81000000 0 NOTYPE GLOBAL DEFAULT 1 startup_64 90766: ffffffff81000000 0 NOTYPE GLOBAL DEFAULT 1 startup_64
``` ```
Note that ,the address of `startup_64` routine is not `ffffffff80000000`, but `ffffffff81000000`. Now I'll explain why. Note that the address of the `startup_64` routine is not `ffffffff80000000`, but `ffffffff81000000` and now I'll explain why.
We can see the following definition in the [arch/x86/kernel/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S): We can see following definition in the [arch/x86/kernel/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S):
``` ```
. = __START_KERNEL; . = __START_KERNEL;
@ -172,13 +176,12 @@ Where `__START_KERNEL` is:
#define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START) #define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START)
``` ```
`__START_KERNEL_map` is the value from documentation - `ffffffff80000000` and `__PHYSICAL_START` is `0x1000000`. That's why address of the `startup_64` is `ffffffff81000000`. `__START_KERNEL_map` is the value from the documentation - `ffffffff80000000` and `__PHYSICAL_START` is `0x1000000`. That's why address of the `startup_64` is `ffffffff81000000`.
At last we can get program headers from `vmlinux` with the following command: And at last we can get program headers from `vmlinux` with the following command:
``` ```
readelf -l vmlinux
$ readelf -l vmlinux
Elf file type is EXEC (Executable file) Elf file type is EXEC (Executable file)
Entry point 0x1000000 Entry point 0x1000000
@ -210,6 +213,6 @@ Program Headers:
.smp_locks .data_nosave .bss .brk .smp_locks .data_nosave .bss .brk
``` ```
Here we can see five segments with sections list. All of these sections you can find in the generated linker script at - `arch/x86/kernel/vmlinux.lds`. Here we can see five segments with sections list. You can find all of these sections in the generated linker script at - `arch/x86/kernel/vmlinux.lds`.
That's all. Of course it's not a full description of ELF(Executable and Linkable Format), but if you are interested in it, you can find documentation - [here](http://www.uclibc.org/docs/elf-64-gen.pdf) That's all. Of course it's not a full description of ELF (Executable and Linkable Format), but if you want to know more, you can find the documentation - [here](http://www.uclibc.org/docs/elf-64-gen.pdf)