1
0
mirror of https://github.com/0xAX/linux-insides.git synced 2024-11-14 03:39:19 +00:00
linux-insides/SysCall/syscall-1.md

416 lines
25 KiB
Markdown
Raw Normal View History

2015-08-23 14:16:27 +00:00
System call in the Linux kernel. Part 1.
================================================================================
Introduction
--------------------------------------------------------------------------------
This post opens new chapter in [linux-insides](http://0xax.gitbooks.io/linux-insides/content/) book and as you may understand from the title, this chapter will devoted to the [System call](https://en.wikipedia.org/wiki/System_call) concept in the Linux kernel. The choice of the topic for this chapter is not accidental. In the previous [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) we saw interrupts and interrupt handling. Concept of system calls is very similar to interrupts, because the most common way to implement system calls as software interrupts. We will see many different aspects that are related to the system call concept. For example, we will learn what's happening when a system call occurs from userspace, we will see implementation of a couple system call handlers in the Linux kernel, [VDSO](https://en.wikipedia.org/wiki/VDSO) and [vsyscall](https://lwn.net/Articles/446528/) concepts and many many more.
Before we will start to dive into the implementation of the system calls related stuff in the Linux kernel source code, it is good to know some theory about system calls. Let's do it in the following paragraph.
System calls. What is it?
--------------------------------------------------------------------------------
A system call is just an userspace request of a kernel service. Yes, the operating system kernel provides many services. When your program wants to write/read to/from a file, start to listen for connections on a [socket](https://en.wikipedia.org/wiki/Network_socket), delete or create directory, or even to finish its work, a program uses a system call. In another words, a system call is just a [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) function that is placed in the kernel space and an user program can ask kernel to do something via this function.
The Linux kernel provides a set of this function and each architecture provides its own set. For example: the [x86_64](https://en.wikipedia.org/wiki/X86-64) provides [322](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl) system calls and the [x86](https://en.wikipedia.org/wiki/X86) provides [358](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_32.tbl) different system calls. Ok, a system call is just a function. Let's look on a simple `Hello wrold` example that written in assembly programming language:
```assembly
.data
msg:
.ascii "Hello, world!\n"
len = . - msg
.text
.global _start
_start:
movq $1, %rax
movq $1, %rdi
movq $msg, %rsi
movq $len, %rdx
syscall
movq $60, %rax
xorq %rdi, %rdi
syscall
```
We can compile with the following commands:
```
$ gcc -c test.S
$ ld -o test test.o
```
and run it with the:
```
./test
Hello, world!
```
Ok, what do we see here? This simple code represents `Hello world` assembly program for the Linux `x86_64` architecture. We can see two sections here:
* `.data`
* `.text`
The first section - `.data` stores initialized data of our program (`Hello world` string and its length in our case). The second section - `.text` contains code of our program. We can split the code of our program into two parts: first part will be before first `syscall` instruction and the second part will be between first and second `syscall` instructions. First of all what does the `syscall` instruction in our code and generally? As we can read in the [64-ia-32-architectures-software-developer-vol-2b-manual](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html):
```
SYSCALL invokes an OS system-call handler at privilege level 0. It does so by
loading RIP from the IA32_LSTAR MSR (after saving the address of the instruction
following SYSCALL into RCX). (The WRMSR instruction ensures that the
IA32_LSTAR MSR always contain a canonical address.)
...
...
...
SYSCALL loads the CS and SS selectors with values derived from bits 47:32 of the
IA32_STAR MSR. However, the CS and SS descriptor caches are not loaded from the
descriptors (in GDT or LDT) referenced by those selectors.
Instead, the descriptor caches are loaded with fixed values. It is the respon-
sibility of OS software to ensure that the descriptors (in GDT or LDT) referenced
by those selector values correspond to the fixed values loaded into the descriptor
caches; the SYSCALL instruction does not ensure this correspondence.
```
and we are initilizing `syscalls` by the writing of the `entry_SYSCALL_64` that defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembler file and represents `SYSCALL` instruction entry to the `IA32_STAR` [Model specific register](https://en.wikipedia.org/wiki/Model-specific_register):
```C
wrmsrl(MSR_LSTAR, entry_SYSCALL_64);
```
in the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c) source code file.
So, the `syscall` instruction invokes a handler of a given system call. But how it knows, what's handler to call? Actually it gets this information from the general purpose [registers](https://en.wikipedia.org/wiki/Processor_register). As you can see in the system call [table](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl), each system call has an unique number. In our example, first system call is - `write` that writes data to the given file. Let's look in the system call table and try to find `write` system call. As we can see, the [write](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L10) system call has number - `1`. We pass number of this system call through `rax` register in our example. The next general purpose registers: `%rdi`, `%rsi` and `%rdx` takes parameters of the `write` syscall. In our case, they are [file descriptor](https://en.wikipedia.org/wiki/File_descriptor) (`1` is [stdout](https://en.wikipedia.org/wiki/Standard_streams#Standard_output_.28stdout.29) in our case), second parameter is the pointer to our string, and the third is size of data. Yes, you heard right. Parameters for a system call. As I already wrote above, a system call is a just `C` function in the kernel space. In our case first system call is write. This system call defined in the [fs/read_write.c](https://github.com/torvalds/linux/blob/master/fs/read_write.c) source code file and looks like:
```C
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
size_t, count)
{
...
...
...
}
```
Or in another words:
```C
ssize_t write(int fd, const void *buf, size_t nbytes);
```
Don't worry about the `SYSCALL_DEFINE3` macro for now, we'll come back to it.
The second part of our example is the same, but we call other system call. In this case we call [exit](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L69) system call. This system call gets only one parameter:
* Return value
and handles exit of our program. We can pass program name of our program to the [strace](https://en.wikipedia.org/wiki/Strace) util and we will see our system calls:
```
$ strace test
execve("./test", ["./test"], [/* 62 vars */]) = 0
write(1, "Hello, world!\n", 14Hello, world!
) = 14
_exit(0) = ?
+++ exited with 0 +++
```
In the first file of the `strace` output, we can see [execve](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L68) system call that executes our program, and the second and third are system calls that we have used in our program: `write` and `exit`. Note that we pass parameter through the general purpose registers in our example. The order of the registers is not not accidental. Order of the registers defined by the following agreement - [x86-64 calling conventions](https://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions). This and other agreement for the `x86_64` architecture explained in the special document - [System V Application Binary Interface. PDF](http://www.x86-64.org/documentation/abi.pdf). In a general way, argument(s) of a function are placed either in registers or pushed on the stack. The right order is:
* `rdi`;
* `rsi`;
* `rdx`;
* `rcx`;
* `r8`;
* `r9`.
for the first six parameters of a function. If a function has more than six arguments, other parameters will be placed on the stack.
We do not use system calls in our code directly, but anyway our program uses it when we want to print something, check access to a file or just write or read something to it.
For example:
```C
#include <stdio.h>
int main(int argc, char **argv)
{
FILE *fp;
char buff[255];
fp = fopen("test.txt", "r");
fgets(buff, 255, fp);
printf("%s\n", buff);
fclose(fp);
return 0;
}
```
There are no `fopen`, `fgets`, `printf` and `fclose` system calls in the Linux kernel, but `open`, `read` `write` and `close` instead. I think you know that these four functions `fopen`, `fgets`, `printf` and `fclose` are just functions that defined in the `C` [standard library](https://en.wikipedia.org/wiki/GNU_C_Library). Actually these functions are wrappers for the system calls. We do not call system calls directly in our code, but using [wrapper](https://en.wikipedia.org/wiki/Wrapper_function) functions from the standard library. The main reason of this is simple: system call must be performed quickly, very quickly. As a system call must be quickly, it must be small. The standard library takes responsibility to call system call with the correct set parameters and makes different check before it will call the given system call. Let's compile our program with the following command:
```
$ gcc test.c -o test
```
and look on it with the [ltrace](https://en.wikipedia.org/wiki/Ltrace) util:
```
$ ltrace ./test
__libc_start_main([ "./test" ] <unfinished ...>
fopen("test.txt", "r") = 0x602010
fgets("Hello World!\n", 255, 0x602010) = 0x7ffd2745e700
puts("Hello World!\n"Hello World!
) = 14
fclose(0x602010) = 0
+++ exited (status 0) +++
```
The `ltrace` util displays a set of userspace calls of a program. The `fopen` function opens the given text file, the `fgets` reads file content to the `buf` buffer, the `puts` function prints it to the `stdout` and the `fclose` function closes file by the given file descriptor. And as I already wrote, all of these functions call an appropriate system call. For example `puts` calls the `write` system call inside, we can see it if we will add `-S` option to the `ltrace` program:
```
write@SYS(1, "Hello World!\n\n", 14) = 14
```
Yes, system calls are ubiquitous. Each program needs to open/write/read file, network connection, allocation of memory and many other things that can be provide only by the kernel. The [proc](https://en.wikipedia.org/wiki/Procfs) file system contains special file in a format: `/proc/pid/systemcall` that exposes the system call number and argument registers for the system call currently being executed by the process. For example, first pid that is [systemd](https://en.wikipedia.org/wiki/Systemd) for me uses:
```
$ sudo cat /proc/1/comm
systemd
$ sudo cat /proc/1/syscall
232 0x4 0x7ffdf82e11b0 0x1f 0xffffffff 0x100 0x7ffdf82e11bf 0x7ffdf82e11a0 0x7f9114681193
```
the system call with number - `232` which is [epoll_wait](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L241) system call that waits for an I/O event on an [epoll](https://en.wikipedia.org/wiki/Epoll) file descriptor. Or for example `emacs` editor where I'm writing this part:
```
$ ps ax | grep emacs
2093 ? Sl 2:40 emacs
$ sudo cat /proc/2093/comm
emacs
$ sudo cat /proc/2093/syscall
270 0xf 0x7fff068a5a90 0x7fff068a5b10 0x0 0x7fff068a59c0 0x7fff068a59d0 0x7fff068a59b0 0x7f777dd8813c
```
the system call with the number `270` which is [sys_pselect6](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L279) system call that allows `emacs` to monitor multiple file descriptors.
Now we know a little about system call, what is it and why do we need in it. So let's look on the `write` system that our program used.
Implementation of write system call
--------------------------------------------------------------------------------
Let's look on the implementation of this system call directly in the source code of the Linux kernel. As we already know, the `write` system call defined in the [fs/read_write.c](https://github.com/torvalds/linux/blob/master/fs/read_write.c) source code file and looks like this:
```C
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
size_t, count)
{
struct fd f = fdget_pos(fd);
ssize_t ret = -EBADF;
if (f.file) {
loff_t pos = file_pos_read(f.file);
ret = vfs_write(f.file, buf, count, &pos);
if (ret >= 0)
file_pos_write(f.file, pos);
fdput_pos(f);
}
return ret;
}
```
First of all about the `SYSCALL_DEFINE3` macro. This macro defined in the [include/linux/syscalls.h](https://github.com/torvalds/linux/blob/master/include/linux/syscalls.h) header file and expands to the definition of the `sys_name(...)` function. Let's look on this macro:
```C
#define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
#define SYSCALL_DEFINEx(x, sname, ...) \
SYSCALL_METADATA(sname, x, __VA_ARGS__) \
__SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
```
As we can see the `SYSCALL_DEFINE3` macro takes `name` parameter which will represent name of a system call and variadic number of parameters. This macro just expands to the `SYSCALL_DEFINEx` macro that takes the number of the parameters the given system call, the `_##name` stub for the future name of the system call (more about tokens concatenation with the `##` you can read in the [documentation](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html) of [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)). Next we can see the `SYSCALL_DEFINEx` macro. This macro expands to the two following macros:
* `SYSCALL_METADATA`;
* `__SYSCALL_DEFINEx`.
Implementation of the first macro `SYSCALL_METADATA` depends on the `CONFIG_FTRACE_SYSCALLS` kernel configuration option. As we can understand from the name of this option, it allows to enable tracer to catch the syscall entry and exit events. If this kernel configration option is enabled, the `SYSCALL_METADATA` macro executes initialization of the `syscall_metadata` structure that defined in the [include/trace/syscall.h](https://github.com/torvalds/linux/blob/master/include/trace/syscall.h) header file and contains different useful fields as name of a system call, number of a system call in the system call [table](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl), number of parameters of a system call, list of parameter types and etc:
```C
#define SYSCALL_METADATA(sname, nb, ...) \
... \
... \
... \
struct syscall_metadata __used \
__syscall_meta_##sname = { \
.name = "sys"#sname, \
.syscall_nr = -1, \
.nb_args = nb, \
.types = nb ? types_##sname : NULL, \
.args = nb ? args_##sname : NULL, \
.enter_event = &event_enter_##sname, \
.exit_event = &event_exit_##sname, \
.enter_fields = LIST_HEAD_INIT(__syscall_meta_##sname.enter_fields), \
}; \
static struct syscall_metadata __used \
__attribute__((section("__syscalls_metadata"))) \
*__p_syscall_meta_##sname = &__syscall_meta_##sname;
```
If the `CONFIG_FTRACE_SYSCALLS` kernel option does not enabled during kernel configuration, in this way the `SYSCALL_METADATA` macro expands to empty string:
```C
#define SYSCALL_METADATA(sname, nb, ...)
```
The second macro `__SYSCALL_DEFINEx` expands to the definition of the five following functions:
```C
#define __SYSCALL_DEFINEx(x, name, ...) \
asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \
__attribute__((alias(__stringify(SyS##name)))); \
\
static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__)); \
\
asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__)); \
\
asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \
{ \
long ret = SYSC##name(__MAP(x,__SC_CAST,__VA_ARGS__)); \
__MAP(x,__SC_TEST,__VA_ARGS__); \
__PROTECT(x, ret,__MAP(x,__SC_ARGS,__VA_ARGS__)); \
return ret; \
} \
\
static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__))
```
The first `sys##name` is definition of the syscall handler function with the given name - `sys_system_call_name`. The `__SC_DECL` macro takes the `__VA_ARGS__` and combines call input parameter system type and the parameter name, because the macro definition is unable to determine the parameter types. And the `__MAP` macro applyes `__SC_DECL` macro to the `__VA_ARGS__` arguments. The other functions that are generated by the `__SYSCALL_DEFINEx` macro are need to protect from the [CVE-2009-0029](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2009-0029) and we will not dive into details about this here. Ok, as result of the `SYSCALL_DEFINE3` macro, we will have:
```C
asmlinkage long sys_write(unsigned int fd, const char __user * filename, size_t count);
```
Now we know a little about system calls definition and we can back to the implementation of the `write` system call. Let's look on the implementation of this system call again:
```C
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
size_t, count)
{
struct fd f = fdget_pos(fd);
ssize_t ret = -EBADF;
if (f.file) {
loff_t pos = file_pos_read(f.file);
ret = vfs_write(f.file, buf, count, &pos);
if (ret >= 0)
file_pos_write(f.file, pos);
fdput_pos(f);
}
return ret;
}
```
As we already know and can see on the code, it takes three arguments:
* `fd` - file descriptor;
* `buf` - buffer to write;
* `count` - length of buffer to write.
and writes data from a buffer declared by the user to a given device or a file. Note that the second parameter `buf`, defined with the `__user` attribute. The main purpose of this attribute is for checking of the Linux kernel code with the [sparse](https://en.wikipedia.org/wiki/Sparse) util. It defined in the [include/linux/compiler.h](https://github.com/torvalds/linux/blob/master/include/linux/compiler.h) header file and depends on the `__CHECKER__` definition in the Linux kernel. That's all about useful meta-information related to our `sys_write` system call, let's try to understand how this system call is implemented. As we can see it starts from the definition of the `f` structure that has `fd` structure type that represent file descriptor in the Linux kernel and we put the result of the call of the `fdget_pos` function. The `fdget_pos` function defined in the same [source](https://github.com/torvalds/linux/blob/master/fs/read_write.c) code file and just expands to the call of the `__to_fd` function:
```C
static inline struct fd fdget_pos(int fd)
{
return __to_fd(__fdget_pos(fd));
}
```
The main purpose of the `fdget_pos` is convert given file descriptor which is just number to the `fd` strucutre. Through the long chain of function calls, the `fdget_pos` function get the file descriptor table of the current process or in another words `current->files` and tries to find correspnding file descriptor number there. As we got `fd` structure for the given file descriptor number, we check it and return if it does not exist. In other way we get the current position in the file with the call of the `file_pos_read` function that just returns `f_pos` field of the our file:
```C
static inline loff_t file_pos_read(struct file *file)
{
return file->f_pos;
}
```
and call the `vfs_write` function. The `vfs_write` function defined the [fs/read_write.c](https://github.com/torvalds/linux/blob/master/fs/read_write.c) source code file and does main work for us - writes given buffer to the given file starting from the given position. We will not dive into details about the `vfs_write` function, because this function is weakly related to the `system call` concept but mostly about [Virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system) concept which we will see in another chapter. As the `vfs_write` has finished its work, we check the result of it and if it was finished successfully we change the position in the file with the `file_pos_write` function:
```C
if (ret >= 0)
file_pos_write(f.file, pos);
```
that just updates `f_pos` with the given position of the give file:
```C
static inline void file_pos_write(struct file *file, loff_t pos)
{
file->f_pos = pos;
}
```
In the end of the our `write` system call handler, we can see call of the following function:
```C
fdput_pos(f);
```
unlocks the `f_pos_lock` mutex that protects file position during concurrently write from threads that share file descriptor.
That's all.
Just now, we partly saw implementation one of system calls that provided by the Linux kernel. Of course we have missed some parts in the implementation of the `write` system call in this part, because as I already wrote above, we will see only system calls related stuff in this chapter and will not see other stuff related to the other subsystem as [Virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system) and etc.
Conclusion
--------------------------------------------------------------------------------
This is the end of the first part about system calls concept in the Linux kernel. We saw theory about this concept in this part and in the next part we will continue to dive into this topic and start to touch Linux kernel code which is related to the system calls.
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-internals/issues/new).
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-insides).**
Links
--------------------------------------------------------------------------------
* [system call](https://en.wikipedia.org/wiki/System_call)
* [vdso](https://en.wikipedia.org/wiki/VDSO)
* [vsyscall](https://lwn.net/Articles/446528/)
* [general purpose registers](https://en.wikipedia.org/wiki/Processor_register)
* [socket](https://en.wikipedia.org/wiki/Network_socket)
* [C programming language](https://en.wikipedia.org/wiki/C_%28programming_language%29)
* [x86](https://en.wikipedia.org/wiki/X86)
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
* [x86-64 calling conventions](https://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions)
* [System V Application Binary Interface. PDF](http://www.x86-64.org/documentation/abi.pdf)
* [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)
* [Intel manual. PDF](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html)
* [system call table](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl)
* [GCC macro documentation](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html)
* [file descriptor](https://en.wikipedia.org/wiki/File_descriptor)
* [stdout](https://en.wikipedia.org/wiki/Standard_streams#Standard_output_.28stdout.29)
* [strace](https://en.wikipedia.org/wiki/Strace)
* [standard library](https://en.wikipedia.org/wiki/GNU_C_Library)
* [wrapper functions](https://en.wikipedia.org/wiki/Wrapper_function)
* [ltrace](https://en.wikipedia.org/wiki/Ltrace)
* [sparse](https://en.wikipedia.org/wiki/Sparse)
* [proc file system](https://en.wikipedia.org/wiki/Procfs)
* [Virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system)
* [systemd](https://en.wikipedia.org/wiki/Systemd)
* [epoll](https://en.wikipedia.org/wiki/Epoll)
* [Previous chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html)