From 33fb1c940a7267cb0ff2e47591591ad0cdd20c89 Mon Sep 17 00:00:00 2001 From: Ian Miell Date: Sun, 2 Aug 2015 20:56:18 +0100 Subject: [PATCH 01/15] Nit-picks and corrections up to "GNU Linker" --- Misc/linkers.md | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/Misc/linkers.md b/Misc/linkers.md index 726f5f4..a4219ff 100644 --- a/Misc/linkers.md +++ b/Misc/linkers.md @@ -3,7 +3,7 @@ Introduction During the writing of the [linux-insides](http://0xax.gitbooks.io/linux-insides/content/) book I have received many emails with questions related to the [linker](https://en.wikipedia.org/wiki/Linker_%28computing%29) script and linker-related subjects. So I've decided to write this to cover some aspects of the linker and the linking of object files. -If we open page the `Linker` page on wikipidia, we will see following definition: +If we open the `Linker` page on wikipidia, we will see following definition: >In computer science, a linker or link editor is a computer program that takes one or more object files generated by a compiler and combines them into a single executable file, library file, or another object file. @@ -12,7 +12,7 @@ If you've written at least one program on C in your life, you will have seen fil Linking process --------------- -Let's create simple project with the following structure: +Let's create a simple project with the following structure: ``` *-linkers @@ -21,7 +21,7 @@ Let's create simple project with the following structure: *--lib.h ``` -And write there our example factorial program. Our `main.c` source code file contains: +Our `main.c` source code file contains: ```C #include @@ -34,7 +34,7 @@ int main(int argc, char **argv) { } ``` -The `lib.c` contains: +The `lib.c` file contains: ```C int factorial(int base) { @@ -140,14 +140,14 @@ Relocation is the process of connecting symbolic references with symbolic defini 19: 89 c6 mov %eax,%esi ``` -Note `e8 00 00 00 00` on the first line. The `e8` is the [opcode](https://en.wikipedia.org/wiki/Opcode) of the `call` instruction with a relative offset. So the `e8 00 00 00 00` contains a one-byte operation code followed by a four-byte address. Note that the `00 00 00 00` is 4-bytes, but why only 4-bytes if an address can be 8-bytes in the `x86_64`. Actually we compiled the `main.c` source code file with the `-mcmodel=small`. From the `gcc` man: +Note the `e8 00 00 00 00` on the first line. The `e8` is the [opcode](https://en.wikipedia.org/wiki/Opcode) of the `call`, and the remainder of the line is a relative offset. So the `e8 00 00 00 00` contains a one-byte operation code followed by a four-byte address. Note that the `00 00 00 00` is 4-bytes. Why only 4-bytes if an address can be 8-bytes in a `x86_64` (64-bit) machine? Actually we compiled the `main.c` source code file with the `-mcmodel=small`! From the `gcc` man page: ``` -mcmodel=small Generate code for the small code model: the program and its symbols must be linked in the lower 2 GB of the address space. Pointers are 64 bits. Programs can be statically or dynamically linked. This is the default code model. ``` -Of course we didn't pass this option to the `gcc` when we compiled the `main.c`, but it is default. We know that our program will be linked in the lower 2 GB of the address space from the quoute from `gcc` manual. In this way 4-bytes enough for this. So we have opcode of the `call` instruction and unknown address. When we compile `main.c` with all dependencies to the executable file and will look on the call of the factorial we will see: +Of course we didn't pass this option to the `gcc` when we compiled the `main.c`, but it is the default. We know that our program will be linked in the lower 2 GB of the address space from the `gcc` manual extract above. Four bytes is therefore enough for this. So we have opcode of the `call` instruction and an unknown address. When we compile `main.c` with all its dependencies to an executable file, and then look at the factorial call we see: ``` $ gcc main.c lib.c -o factorial | objdump -S factorial | grep factorial @@ -168,7 +168,7 @@ factorial: file format elf64-x86-64 ... ``` -As we can see in the previous output, the address of the `main` function is `0x0000000000400506`. Why it does not starts from the `0x0`? You already can know that standard C program is linked with the `glibc` C standard library if the `-nostdlib` was not passed to the `gcc`. The compiled code for a program includes constructors functions to initialize data in the program when the program is started. These functions need to be called before the program is started or in another words before the `main` function is called. To make the initialization and termination functions work, the compiler must output something in the assembler code to cause those functions to be called at the appropriate time. Execution of this program will starts from the code that placed in the special section which is called `.init`. We can see it in the beginning of the objdump output: +As we can see in the previous output, the address of the `main` function is `0x0000000000400506`. Why it does not start from `0x0`? You may already know that standard C programs are linked with the `glibc` C standard library (assuming the `-nostdlib` was not passed to the `gcc`). The compiled code for a program includes constructor functions to initialize data in the program when the program is started. These functions need to be called before the program is started, or in another words before the `main` function is called. To make the initialization and termination functions work, the compiler must output something in the assembler code to cause those functions to be called at the appropriate time. Execution of this program will start from the code placed in the special `.init` section. We can see this in the beginning of the objdump output: ``` objdump -S factorial | less @@ -182,23 +182,25 @@ Disassembly of section .init: 4003ac: 48 8b 05 a5 05 20 00 mov 0x2005a5(%rip),%rax # 600958 <_DYNAMIC+0x1d0> ``` -Not that it starts at the `0x00000000004003a8` address relative to the `glibc` code. We can check it also in the resulted [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format): +Not that it starts at the `0x00000000004003a8` address relative to the `glibc` code. We can check it also in the [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) output by running `readelf`: ``` $ readelf -d factorial | grep \(INIT\) 0x000000000000000c (INIT) 0x4003a8 ``` -So, the address of the `main` function is the `0000000000400506` and it is offset from the `.init` section. As we can see from the output, the address of the `factorial` function is `0x0000000000400537` and binary code for the call of the `factorial` function now is `e8 18 00 00 00`. We already knwo that `e8` is opcode for the `call` instruction, the next `18 00 00 00` (note that address represented as little endian for the `x86_64`, in other words it is `00 00 00 18`) is the offset from the `callq` to the `factorial` function: +So, the address of the `main` function is `0000000000400506` and is offset from the `.init` section. As we can see from the output, the address of the `factorial` function is `0x0000000000400537` and binary code for the call of the `factorial` function now is `e8 18 00 00 00`. We already know that `e8` is opcode for the `call` instruction, the next `18 00 00 00` (note that address represented as little endian for `x86_64`, so it is `00 00 00 18`) is the offset from the `callq` to the `factorial` function: ```python >>> hex(0x40051a + 0x18 + 0x5) == hex(0x400537) True ``` -So we add `0x18` and `0x5` to the address of the `call` instruction. The offset is measured from the address of the following instruction. Our call instruction is 5-bytes size - `e8 18 00 00 00` and the `0x18` is the offset from the next after call instruction to the `factorial` function. A compiler generally creates each object file with the program addresses starting at zero. But if a program is created from multiple object files, all of they will be overlapped. Just now we saw a process which called - `relocation`. This process assigns load addresses to the various parts of the program, adjusting the code and data in the program to reflect the assigned addresses. +So we add `0x18` and `0x5` to the address of the `call` instruction. The offset is measured from the address of the following instruction. Our call instruction is 5-bytes long (`e8 18 00 00 00`) and the `0x18` is the offset of the call after the `factorial` function. A compiler generally creates each object file with the program addresses starting at zero. But if a program is created from multiple object files, these will overlap. -Ok, now we know a little about linkers and relocation. Time to link our object files and to know more about linkers. +What we have seen in this section is the `relocation` process. This process assigns load addresses to the various parts of the program, adjusting the code and data in the program to reflect the assigned addresses. + +Ok, now that we know a little about linkers and relocation it is time to learn more about linkers by linking our object files. GNU linker ----------------- From e8b9f62ee2cebd5be7e28755db0bf6fc6676f5d2 Mon Sep 17 00:00:00 2001 From: Nikola Kotur Date: Tue, 18 Aug 2015 22:14:26 +0200 Subject: [PATCH 02/15] Update radix-tree.md --- DataStructures/radix-tree.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/DataStructures/radix-tree.md b/DataStructures/radix-tree.md index 90c0a69..8c09017 100644 --- a/DataStructures/radix-tree.md +++ b/DataStructures/radix-tree.md @@ -4,7 +4,7 @@ Data Structures in the Linux Kernel Radix tree -------------------------------------------------------------------------------- -As you already know linux kernel provides many different libraries and functions which implement different data structures and algorithm. In this part we will consider one of these data structures - [Radix tree](http://en.wikipedia.org/wiki/Radix_tree). There are two files which are related to `radix tree` implementation and API in the linux kernel: +As you already know linux kernel provides many different libraries and functions which implement different data structures and algorithms. In this part we will consider one of these data structures - [Radix tree](http://en.wikipedia.org/wiki/Radix_tree). There are two files which are related to `radix tree` implementation and API in the linux kernel: * [include/linux/radix-tree.h](https://github.com/torvalds/linux/blob/master/include/linux/radix-tree.h) * [lib/radix-tree.c](https://github.com/torvalds/linux/blob/master/lib/radix-tree.c) From 52c5b527b44bda0d9806fefe45f2d5592b3dbe22 Mon Sep 17 00:00:00 2001 From: Michael Aquilina Date: Sun, 23 Aug 2015 21:53:59 +0100 Subject: [PATCH 03/15] Numerous grammatical fixes --- SysCall/syscall-1.md | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/SysCall/syscall-1.md b/SysCall/syscall-1.md index 0965ef2..a21b3b7 100644 --- a/SysCall/syscall-1.md +++ b/SysCall/syscall-1.md @@ -4,16 +4,16 @@ System calls in the Linux kernel. Part 1. Introduction -------------------------------------------------------------------------------- -This post opens new chapter in [linux-insides](http://0xax.gitbooks.io/linux-insides/content/) book and as you may understand from the title, this chapter will devoted to the [System call](https://en.wikipedia.org/wiki/System_call) concept in the Linux kernel. The choice of the topic for this chapter is not accidental. In the previous [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) we saw interrupts and interrupt handling. Concept of system calls is very similar to interrupts, because the most common way to implement system calls as software interrupts. We will see many different aspects that are related to the system call concept. For example, we will learn what's happening when a system call occurs from userspace, we will see implementation of a couple system call handlers in the Linux kernel, [VDSO](https://en.wikipedia.org/wiki/VDSO) and [vsyscall](https://lwn.net/Articles/446528/) concepts and many many more. +This post opens up a new chapter in [linux-insides](http://0xax.gitbooks.io/linux-insides/content/) book and as you may understand from the title, this chapter will be devoted to the [System call](https://en.wikipedia.org/wiki/System_call) concept in the Linux kernel. The choice of topic for this chapter is not accidental. In the previous [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) we saw interrupts and interrupt handling. The concept of system calls is very similar to that of interrupts. This is because the most common way to implement system calls is as software interrupts. We will see many different aspects that are related to the system call concept. For example, we will learn what's happening when a system call occurs from userspace, we will see an implementation of a couple system call handlers in the Linux kernel, [VDSO](https://en.wikipedia.org/wiki/VDSO) and [vsyscall](https://lwn.net/Articles/446528/) concepts and many many more. -Before we will start to dive into the implementation of the system calls related stuff in the Linux kernel source code, it is good to know some theory about system calls. Let's do it in the following paragraph. +Before we start to dive into the implementation of the system calls related stuff in the Linux kernel source code, it is good to know some theory about system calls. Let's do it in the following paragraph. System call. What is it? -------------------------------------------------------------------------------- -A system call is just an userspace request of a kernel service. Yes, the operating system kernel provides many services. When your program wants to write to or read from a file, start to listen for connections on a [socket](https://en.wikipedia.org/wiki/Network_socket), delete or create directory, or even to finish its work, a program uses a system call. In another words, a system call is just a [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) function that is placed in the kernel space and an user program can ask kernel to do something via this function. +A system call is just a userspace request of a kernel service. Yes, the operating system kernel provides many services. When your program wants to write to or read from a file, starts to listen for connections on a [socket](https://en.wikipedia.org/wiki/Network_socket), delete or create a directory, or even to finish its work, a program uses a system call. In another words, a system call is just a [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) function that is placed in the kernel space and a user program can ask the kernel to do something via this function. -The Linux kernel provides a set of these functions and each architecture provides its own set. For example: the [x86_64](https://en.wikipedia.org/wiki/X86-64) provides [322](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl) system calls and the [x86](https://en.wikipedia.org/wiki/X86) provides [358](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_32.tbl) different system calls. Ok, a system call is just a function. Let's look on a simple `Hello world` example that written in assembly programming language: +The Linux kernel provides a set of these functions and each architecture provides its own set. For example: the [x86_64](https://en.wikipedia.org/wiki/X86-64) provides [322](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl) system calls and the [x86](https://en.wikipedia.org/wiki/X86) provides [358](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_32.tbl) different system calls. Ok, a system call is just a function. Let's look on a simple `Hello world` example that's written in the assembly programming language: ```assembly .data @@ -37,14 +37,14 @@ _start: syscall ``` -We can compile with the following commands: +We can compile the above with the following commands: ``` $ gcc -c test.S $ ld -o test test.o ``` -and run it with the: +and run it as follows: ``` ./test @@ -56,7 +56,7 @@ Ok, what do we see here? This simple code represents `Hello world` assembly prog * `.data` * `.text` -The first section - `.data` stores initialized data of our program (`Hello world` string and its length in our case). The second section - `.text` contains code of our program. We can split the code of our program into two parts: first part will be before first `syscall` instruction and the second part will be between first and second `syscall` instructions. First of all what does the `syscall` instruction in our code and generally? As we can read in the [64-ia-32-architectures-software-developer-vol-2b-manual](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html): +The first section - `.data` stores initialized data of our program (`Hello world` string and its length in our case). The second section - `.text` contains the code of our program. We can split the code of our program into two parts: first part will be before the first `syscall` instruction and the second part will be between first and second `syscall` instructions. First of all what does the `syscall` instruction do in our code and generally? As we can read in the [64-ia-32-architectures-software-developer-vol-2b-manual](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html): ``` SYSCALL invokes an OS system-call handler at privilege level 0. It does so by @@ -76,7 +76,7 @@ by those selector values correspond to the fixed values loaded into the descript caches; the SYSCALL instruction does not ensure this correspondence. ``` -and we are initilizing `syscalls` by the writing of the `entry_SYSCALL_64` that defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembler file and represents `SYSCALL` instruction entry to the `IA32_STAR` [Model specific register](https://en.wikipedia.org/wiki/Model-specific_register): +and we are initializing `syscalls` by the writing of the `entry_SYSCALL_64` that defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembler file and represents `SYSCALL` instruction entry to the `IA32_STAR` [Model specific register](https://en.wikipedia.org/wiki/Model-specific_register): ```C wrmsrl(MSR_LSTAR, entry_SYSCALL_64); @@ -84,7 +84,7 @@ wrmsrl(MSR_LSTAR, entry_SYSCALL_64); in the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/common.c) source code file. -So, the `syscall` instruction invokes a handler of a given system call. But how it knows, what's handler to call? Actually it gets this information from the general purpose [registers](https://en.wikipedia.org/wiki/Processor_register). As you can see in the system call [table](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl), each system call has an unique number. In our example, first system call is - `write` that writes data to the given file. Let's look in the system call table and try to find `write` system call. As we can see, the [write](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L10) system call has number - `1`. We pass number of this system call through `rax` register in our example. The next general purpose registers: `%rdi`, `%rsi` and `%rdx` takes parameters of the `write` syscall. In our case, they are [file descriptor](https://en.wikipedia.org/wiki/File_descriptor) (`1` is [stdout](https://en.wikipedia.org/wiki/Standard_streams#Standard_output_.28stdout.29) in our case), second parameter is the pointer to our string, and the third is size of data. Yes, you heard right. Parameters for a system call. As I already wrote above, a system call is a just `C` function in the kernel space. In our case first system call is write. This system call defined in the [fs/read_write.c](https://github.com/torvalds/linux/blob/master/fs/read_write.c) source code file and looks like: +So, the `syscall` instruction invokes a handler of a given system call. But how does it know which handler to call? Actually it gets this information from the general purpose [registers](https://en.wikipedia.org/wiki/Processor_register). As you can see in the system call [table](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl), each system call has an unique number. In our example, first system call is - `write` that writes data to the given file. Let's look in the system call table and try to find `write` system call. As we can see, the [write](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L10) system call has number - `1`. We pass the number of this system call through the `rax` register in our example. The next general purpose registers: `%rdi`, `%rsi` and `%rdx` take parameters of the `write` syscall. In our case, they are [file descriptor](https://en.wikipedia.org/wiki/File_descriptor) (`1` is [stdout](https://en.wikipedia.org/wiki/Standard_streams#Standard_output_.28stdout.29) in our case), second parameter is the pointer to our string, and the third is size of data. Yes, you heard right. Parameters for a system call. As I already wrote above, a system call is a just `C` function in the kernel space. In our case first system call is write. This system call defined in the [fs/read_write.c](https://github.com/torvalds/linux/blob/master/fs/read_write.c) source code file and looks like: ```C SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, @@ -96,7 +96,7 @@ SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, } ``` -Or in another words: +Or in other words: ```C ssize_t write(int fd, const void *buf, size_t nbytes); @@ -108,7 +108,7 @@ The second part of our example is the same, but we call other system call. In th * Return value -and handles exit of our program. We can pass program name of our program to the [strace](https://en.wikipedia.org/wiki/Strace) util and we will see our system calls: +and handles the way our program exits. We can pass the program name of our program to the [strace](https://en.wikipedia.org/wiki/Strace) util and we will see our system calls: ``` $ strace test @@ -120,7 +120,7 @@ _exit(0) = ? +++ exited with 0 +++ ``` -In the first file of the `strace` output, we can see [execve](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L68) system call that executes our program, and the second and third are system calls that we have used in our program: `write` and `exit`. Note that we pass parameter through the general purpose registers in our example. The order of the registers is not not accidental. Order of the registers defined by the following agreement - [x86-64 calling conventions](https://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions). This and other agreement for the `x86_64` architecture explained in the special document - [System V Application Binary Interface. PDF](http://www.x86-64.org/documentation/abi.pdf). In a general way, argument(s) of a function are placed either in registers or pushed on the stack. The right order is: +In the first file of the `strace` output, we can see [execve](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L68) system call that executes our program, and the second and third are system calls that we have used in our program: `write` and `exit`. Note that we pass the parameter through the general purpose registers in our example. The order of the registers is not not accidental. The order of the registers is defined by the following agreement - [x86-64 calling conventions](https://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions). This and other agreement for the `x86_64` architecture explained in the special document - [System V Application Binary Interface. PDF](http://www.x86-64.org/documentation/abi.pdf). In a general way, argument(s) of a function are placed either in registers or pushed on the stack. The right order is: * `rdi`; * `rsi`; @@ -152,7 +152,7 @@ int main(int argc, char **argv) } ``` -There are no `fopen`, `fgets`, `printf` and `fclose` system calls in the Linux kernel, but `open`, `read` `write` and `close` instead. I think you know that these four functions `fopen`, `fgets`, `printf` and `fclose` are just functions that defined in the `C` [standard library](https://en.wikipedia.org/wiki/GNU_C_Library). Actually these functions are wrappers for the system calls. We do not call system calls directly in our code, but using [wrapper](https://en.wikipedia.org/wiki/Wrapper_function) functions from the standard library. The main reason of this is simple: system call must be performed quickly, very quickly. As a system call must be quick, it must be small. The standard library takes responsibility to call system call with the correct set parameters and makes different check before it will call the given system call. Let's compile our program with the following command: +There are no `fopen`, `fgets`, `printf` and `fclose` system calls in the Linux kernel, but `open`, `read` `write` and `close` instead. I think you know that these four functions `fopen`, `fgets`, `printf` and `fclose` are just functions that defined in the `C` [standard library](https://en.wikipedia.org/wiki/GNU_C_Library). Actually these functions are wrappers for the system calls. We do not call system calls directly in our code, but using [wrapper](https://en.wikipedia.org/wiki/Wrapper_function) functions from the standard library. The main reason of this is simple: a system call must be performed quickly, very quickly. As a system call must be quick, it must be small. The standard library takes responsibility to perform system calls with the correct set parameters and makes different check before it will call the given system call. Let's compile our program with the following command: ``` $ gcc test.c -o test From b3eee2756b9598ae063fb171896f2598d9312b81 Mon Sep 17 00:00:00 2001 From: John-Nicholas Furst Date: Mon, 24 Aug 2015 00:23:47 -0400 Subject: [PATCH 04/15] Fixing typo --- SysCall/syscall-1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SysCall/syscall-1.md b/SysCall/syscall-1.md index 0965ef2..94cb447 100644 --- a/SysCall/syscall-1.md +++ b/SysCall/syscall-1.md @@ -120,7 +120,7 @@ _exit(0) = ? +++ exited with 0 +++ ``` -In the first file of the `strace` output, we can see [execve](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L68) system call that executes our program, and the second and third are system calls that we have used in our program: `write` and `exit`. Note that we pass parameter through the general purpose registers in our example. The order of the registers is not not accidental. Order of the registers defined by the following agreement - [x86-64 calling conventions](https://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions). This and other agreement for the `x86_64` architecture explained in the special document - [System V Application Binary Interface. PDF](http://www.x86-64.org/documentation/abi.pdf). In a general way, argument(s) of a function are placed either in registers or pushed on the stack. The right order is: +In the first line of the `strace` output, we can see [execve](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L68) system call that executes our program, and the second and third are system calls that we have used in our program: `write` and `exit`. Note that we pass parameter through the general purpose registers in our example. The order of the registers is not not accidental. Order of the registers defined by the following agreement - [x86-64 calling conventions](https://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions). This and other agreement for the `x86_64` architecture explained in the special document - [System V Application Binary Interface. PDF](http://www.x86-64.org/documentation/abi.pdf). In a general way, argument(s) of a function are placed either in registers or pushed on the stack. The right order is: * `rdi`; * `rsi`; From 8caaa215046a721c21a0559a9ba64c80d38fbe75 Mon Sep 17 00:00:00 2001 From: Scott Bigelow Date: Sun, 23 Aug 2015 20:28:06 -0700 Subject: [PATCH 05/15] Adding a few connecting words, removing a few others --- SysCall/syscall-1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SysCall/syscall-1.md b/SysCall/syscall-1.md index 38d5318..53c1d5c 100644 --- a/SysCall/syscall-1.md +++ b/SysCall/syscall-1.md @@ -131,7 +131,7 @@ In the first line of the `strace` output, we can see [execve](https://github.com for the first six parameters of a function. If a function has more than six arguments, other parameters will be placed on the stack. -We do not use system calls in our code directly, but anyway our program uses it when we want to print something, check access to a file or just write or read something to it. +We do not use system calls in our code directly, but our program uses it when we want to print something, check access to a file or just write or read something to it. For example: From a1afb76ac7bd1db55c6373e0785588c91d291c2a Mon Sep 17 00:00:00 2001 From: Scott Bigelow Date: Sun, 23 Aug 2015 20:36:51 -0700 Subject: [PATCH 06/15] More fixes, trying to leave author's voice unchanged --- SysCall/syscall-1.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/SysCall/syscall-1.md b/SysCall/syscall-1.md index 53c1d5c..b480afd 100644 --- a/SysCall/syscall-1.md +++ b/SysCall/syscall-1.md @@ -178,7 +178,7 @@ The `ltrace` util displays a set of userspace calls of a program. The `fopen` fu write@SYS(1, "Hello World!\n\n", 14) = 14 ``` -Yes, system calls are ubiquitous. Each program needs to open/write/read file, network connection, allocation of memory and many other things that can be provide only by the kernel. The [proc](https://en.wikipedia.org/wiki/Procfs) file system contains special file in a format: `/proc/pid/systemcall` that exposes the system call number and argument registers for the system call currently being executed by the process. For example, first pid that is [systemd](https://en.wikipedia.org/wiki/Systemd) for me uses: +Yes, system calls are ubiquitous. Each program needs to open/write/read file, network connection, allocate memory and many other things that can be provided only by the kernel. The [proc](https://en.wikipedia.org/wiki/Procfs) file system contains special file in a format: `/proc/pid/systemcall` that exposes the system call number and argument registers for the system call currently being executed by the process. For example, pid 1, that is [systemd](https://en.wikipedia.org/wiki/Systemd) for me: ``` $ sudo cat /proc/1/comm @@ -203,7 +203,7 @@ $ sudo cat /proc/2093/syscall the system call with the number `270` which is [sys_pselect6](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L279) system call that allows `emacs` to monitor multiple file descriptors. -Now we know a little about system call, what is it and why do we need in it. So let's look on the `write` system that our program used. +Now we know a little about system call, what is it and why we need in it. So let's look at the `write` system call that our program used. Implementation of write system call -------------------------------------------------------------------------------- From 17dc115e76687d6f0413b5a6c73a57ddef1a5924 Mon Sep 17 00:00:00 2001 From: Scott Bigelow Date: Sun, 23 Aug 2015 20:53:14 -0700 Subject: [PATCH 07/15] Reworded a bit to make sentences flow better --- SysCall/syscall-1.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/SysCall/syscall-1.md b/SysCall/syscall-1.md index b480afd..fb0c5d1 100644 --- a/SysCall/syscall-1.md +++ b/SysCall/syscall-1.md @@ -208,7 +208,7 @@ Now we know a little about system call, what is it and why we need in it. So let Implementation of write system call -------------------------------------------------------------------------------- -Let's look on the implementation of this system call directly in the source code of the Linux kernel. As we already know, the `write` system call defined in the [fs/read_write.c](https://github.com/torvalds/linux/blob/master/fs/read_write.c) source code file and looks like this: +Let's look at the implementation of this system call directly in the source code of the Linux kernel. As we already know, the `write` system call is defined in the [fs/read_write.c](https://github.com/torvalds/linux/blob/master/fs/read_write.c) source code file and looks like this: ```C SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, @@ -229,7 +229,7 @@ SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, } ``` -First of all about the `SYSCALL_DEFINE3` macro. This macro defined in the [include/linux/syscalls.h](https://github.com/torvalds/linux/blob/master/include/linux/syscalls.h) header file and expands to the definition of the `sys_name(...)` function. Let's look on this macro: +First of all, the `SYSCALL_DEFINE3` macro is defined in the [include/linux/syscalls.h](https://github.com/torvalds/linux/blob/master/include/linux/syscalls.h) header file and expands to the definition of the `sys_name(...)` function. Let's look at this macro: ```C #define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__) @@ -302,7 +302,7 @@ The first `sys##name` is definition of the syscall handler function with the giv asmlinkage long sys_write(unsigned int fd, const char __user * filename, size_t count); ``` -Now we know a little about system calls definition and we can back to the implementation of the `write` system call. Let's look on the implementation of this system call again: +Now we know a little about the system call's definition and we can go back to the implementation of the `write` system call. Let's look on the implementation of this system call again: ```C SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, @@ -323,13 +323,13 @@ SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, } ``` -As we already know and can see on the code, it takes three arguments: +As we already know and can see from the code, it takes three arguments: * `fd` - file descriptor; * `buf` - buffer to write; * `count` - length of buffer to write. -and writes data from a buffer declared by the user to a given device or a file. Note that the second parameter `buf`, defined with the `__user` attribute. The main purpose of this attribute is for checking of the Linux kernel code with the [sparse](https://en.wikipedia.org/wiki/Sparse) util. It defined in the [include/linux/compiler.h](https://github.com/torvalds/linux/blob/master/include/linux/compiler.h) header file and depends on the `__CHECKER__` definition in the Linux kernel. That's all about useful meta-information related to our `sys_write` system call, let's try to understand how this system call is implemented. As we can see it starts from the definition of the `f` structure that has `fd` structure type that represent file descriptor in the Linux kernel and we put the result of the call of the `fdget_pos` function. The `fdget_pos` function defined in the same [source](https://github.com/torvalds/linux/blob/master/fs/read_write.c) code file and just expands to the call of the `__to_fd` function: +and writes data from a buffer declared by the user to a given device or a file. Note that the second parameter `buf`, defined with the `__user` attribute. The main purpose of this attribute is for checking the Linux kernel code with the [sparse](https://en.wikipedia.org/wiki/Sparse) util. It is defined in the [include/linux/compiler.h](https://github.com/torvalds/linux/blob/master/include/linux/compiler.h) header file and depends on the `__CHECKER__` definition in the Linux kernel. That's all about useful meta-information related to our `sys_write` system call, let's try to understand how this system call is implemented. As we can see it starts from the definition of the `f` structure that has `fd` structure type that represent file descriptor in the Linux kernel and we put the result of the call of the `fdget_pos` function. The `fdget_pos` function defined in the same [source](https://github.com/torvalds/linux/blob/master/fs/read_write.c) code file and just expands the call of the `__to_fd` function: ```C static inline struct fd fdget_pos(int fd) @@ -338,7 +338,7 @@ static inline struct fd fdget_pos(int fd) } ``` -The main purpose of the `fdget_pos` is convert given file descriptor which is just number to the `fd` strucutre. Through the long chain of function calls, the `fdget_pos` function get the file descriptor table of the current process or in another words `current->files` and tries to find correspnding file descriptor number there. As we got `fd` structure for the given file descriptor number, we check it and return if it does not exist. In other way we get the current position in the file with the call of the `file_pos_read` function that just returns `f_pos` field of the our file: +The main purpose of the `fdget_pos` is to convert the given file descriptor which is just a number to the `fd` structure. Through the long chain of function calls, the `fdget_pos` function gets the file descriptor table of the current process, `current->files`, and tries to find a corresponding file descriptor number there. As we got the `fd` structure for the given file descriptor number, we check it and return if it does not exist. We get the current position in the file with the call of the `file_pos_read` function that just returns `f_pos` field of the our file: ```C static inline loff_t file_pos_read(struct file *file) @@ -347,14 +347,14 @@ static inline loff_t file_pos_read(struct file *file) } ``` -and call the `vfs_write` function. The `vfs_write` function defined the [fs/read_write.c](https://github.com/torvalds/linux/blob/master/fs/read_write.c) source code file and does main work for us - writes given buffer to the given file starting from the given position. We will not dive into details about the `vfs_write` function, because this function is weakly related to the `system call` concept but mostly about [Virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system) concept which we will see in another chapter. As the `vfs_write` has finished its work, we check the result of it and if it was finished successfully we change the position in the file with the `file_pos_write` function: +and call the `vfs_write` function. The `vfs_write` function defined the [fs/read_write.c](https://github.com/torvalds/linux/blob/master/fs/read_write.c) source code file and does the work for us - writes given buffer to the given file starting from the given position. We will not dive into details about the `vfs_write` function, because this function is weakly related to the `system call` concept but mostly about [Virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system) concept which we will see in another chapter. After the `vfs_write` has finished its work, we check the result and if it was finished successfully we change the position in the file with the `file_pos_write` function: ```C if (ret >= 0) file_pos_write(f.file, pos); ``` -that just updates `f_pos` with the given position of the give file: +that just updates `f_pos` with the given position in the given file: ```C static inline void file_pos_write(struct file *file, loff_t pos) @@ -363,22 +363,22 @@ static inline void file_pos_write(struct file *file, loff_t pos) } ``` -In the end of the our `write` system call handler, we can see call of the following function: +At the end of the our `write` system call handler, we can see the call of the following function: ```C fdput_pos(f); ``` -unlocks the `f_pos_lock` mutex that protects file position during concurrently write from threads that share file descriptor. +unlocks the `f_pos_lock` mutex that protects file position during concurrent writes from threads that share file descriptor. That's all. -Just now, we partly saw implementation one of system calls that provided by the Linux kernel. Of course we have missed some parts in the implementation of the `write` system call in this part, because as I already wrote above, we will see only system calls related stuff in this chapter and will not see other stuff related to the other subsystem as [Virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system) and etc. +We have seen the partial implementation of one system call provided by the Linux kernel. Of course we have missed some parts in the implementation of the `write` system call, because as I mentioned above, we will see only system calls related stuff in this chapter and will not see other stuff related to other subsystems, such as [Virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system). Conclusion -------------------------------------------------------------------------------- -This is the end of the first part about system calls concept in the Linux kernel. We saw theory about this concept in this part and in the next part we will continue to dive into this topic and start to touch Linux kernel code which is related to the system calls. +This is the end of the first part about system calls concept in the Linux kernel. We have covered the theory of system calls so far and in the next part we will continue to dive into this topic, touching Linux kernel code which is related to system calls. If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-internals/issues/new). From 2eb25a53174f34148d1974a599d4bd7fdaf3eff6 Mon Sep 17 00:00:00 2001 From: Scott Bigelow Date: Sun, 23 Aug 2015 20:55:29 -0700 Subject: [PATCH 08/15] Re-reworded conclusion sentences --- SysCall/syscall-1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SysCall/syscall-1.md b/SysCall/syscall-1.md index fb0c5d1..943d878 100644 --- a/SysCall/syscall-1.md +++ b/SysCall/syscall-1.md @@ -378,7 +378,7 @@ We have seen the partial implementation of one system call provided by the Linux Conclusion -------------------------------------------------------------------------------- -This is the end of the first part about system calls concept in the Linux kernel. We have covered the theory of system calls so far and in the next part we will continue to dive into this topic, touching Linux kernel code which is related to system calls. +This concludes the first part covering system call concepts in the Linux kernel. We have covered the theory of system calls so far and in the next part we will continue to dive into this topic, touching Linux kernel code related to system calls. If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-internals/issues/new). From 511688b110a38836b811fdc8d7819f9ca24ce1c9 Mon Sep 17 00:00:00 2001 From: Dave Flogeras Date: Thu, 27 Aug 2015 21:49:42 -0300 Subject: [PATCH 09/15] Fix typo --- Booting/linux-bootstrap-4.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Booting/linux-bootstrap-4.md b/Booting/linux-bootstrap-4.md index 7c8d7bd..15039fc 100644 --- a/Booting/linux-bootstrap-4.md +++ b/Booting/linux-bootstrap-4.md @@ -216,7 +216,7 @@ no_longmode: jmp 1b ``` -We set stack, cheked CPU and now can move on the next step. +We set stack, checked CPU and now can move on the next step. Calculate relocation address -------------------------------------------------------------------------------- From 5dc2b2ee5c8be0b2ff9edf8adafbc548b5ce36cd Mon Sep 17 00:00:00 2001 From: 0xAX <0xAX@users.noreply.github.com> Date: Sat, 29 Aug 2015 23:05:52 +0600 Subject: [PATCH 10/15] Update SUMMARY.md --- SUMMARY.md | 1 + 1 file changed, 1 insertion(+) diff --git a/SUMMARY.md b/SUMMARY.md index e03aeef..e797e75 100644 --- a/SUMMARY.md +++ b/SUMMARY.md @@ -33,6 +33,7 @@ * [Fixmaps and ioremap](mm/linux-mm-2.md) * [System calls](SysCall/README.md) * [Introduction to system calls](SysCall/syscall-1.md) + * [How the Linux kernel handles a system call]() * [SMP]() * [Concepts](Concepts/README.md) * [Per-CPU variables](Concepts/per-cpu.md) From b770115189475e0d79345d0b0026ea83e3d43be7 Mon Sep 17 00:00:00 2001 From: 0xAX <0xAX@users.noreply.github.com> Date: Sun, 30 Aug 2015 20:00:58 +0600 Subject: [PATCH 11/15] Create syscall-2.md --- SysCall/syscall-2.md | 408 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 408 insertions(+) create mode 100644 SysCall/syscall-2.md diff --git a/SysCall/syscall-2.md b/SysCall/syscall-2.md new file mode 100644 index 0000000..a2c6ef1 --- /dev/null +++ b/SysCall/syscall-2.md @@ -0,0 +1,408 @@ +System calls in the Linux kernel. Part 2. +================================================================================ + +How the Linux kernel handles a system call +-------------------------------------------------------------------------------- + +The previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) was the first part of the chapter that describes [system call](https://en.wikipedia.org/wiki/System_call) concept in the Linux kernel. We have learned what is it a `system call` in the Linux kernel and in a operating system kernel in general, looked on this concept from the user space and even saw partly implementation of the [write](http://man7.org/linux/man-pages/man2/write.2.html) system call. In this part we will continue dive into this theme and as we usually did it in other chapters of this book - after some theory we will start to sink lower and lower, and go directly to the Linux kernel code. + +An user application does not call the system call directly from our applications. We did not write `Hello world!` program like: + +```C +int main(int argc, char **argv) +{ + ... + ... + ... + sys_write(fd1, buf, strlen(buf)); + ... + ... +} +``` + +We can use something similar with the help of [C standard library](https://en.wikipedia.org/wiki/GNU_C_Library) and it will look something like this: + +```C +#include + +int main(int argc, char **argv) +{ + ... + ... + ... + write(fd1, buf, strlen(buf)); + ... + ... +} +``` + +But anyway, `write` is not directly system call and not kernel function. An application must fill general purpose registers with the correct values and in the fixed order and call `syscall` instruction to call real system call. In this part we will know, what occurs in the linux kernel when the processor met `syscall` instruction. + +Initialization of the system calls table +-------------------------------------------------------------------------------- + +From the previous part we know that system call concept is very similar to interrupt. Furthermore system calls implemented as software interrupts. So, when the processor handles `syscall` instruction from a user application, this instruction causes an exception which transfers control to an exception handler. As we know, all exception handlers (or in other words kernel [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) functions that will react on a exception) are placed in the kernel code. But how the Linux kernel searches address of the necessary system call handler for the related system call? Linux kernel contains special table which is called - system call table. The system call table represeted by the `sys_call_table` array in the Linux kernel which defined in the [arch/x86/entry/syscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscall_64.c) source code file. Let's look on its implementation: + +```C +asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = { + [0 ... __NR_syscall_max] = &sys_ni_syscall, + #include +}; +``` + +As we can see, the `sys_call_table` is array of `__NR_syscall_max + 1` size where `__NR_syscall_max` macro represents maximum number of the system calls for the certain [architecture](https://en.wikipedia.org/wiki/List_of_CPU_architectures). This book is about [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture, so for our case the `__NR_syscall_max` is `322` and this is correc number for now (when I'm writing this part, Linux kernel version is `4.2.0-rc8+`). We can see this macro in the generated by the [Kbuild](https://www.kernel.org/doc/Documentation/kbuild/makefiles.txt) during kernel compilation header file - `include/generated/asm-offsets.h`: + +```C +#define __NR_syscall_max 322 +``` + +The same number of system calls in the [arch/x86/entry/syscalls/syscall_64.tbl](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L331) for the `x86_64`. The following are two things to which we must turn our attention are type of the `sys_call_table` array and initialization of elements of this array. First of all about type. The `sys_call_ptr_t` represents pointer to a system call table. It defined as [typedef](https://en.wikipedia.org/wiki/Typedef) for a function pointer that returns nothing and and does not take arguments: + +```C +typedef void (*sys_call_ptr_t)(void); +``` + +The second thing is initialization of the `sys_call_table` array. As we can see in the code above, all elements of our array that contains pointers to the system call handlers point to the `sys_ni_syscall`. The `sys_ni_syscall` function represents not-implemented system calls. Yes, for now all elements of the `sys_call_table` array point to the not implemented system calls. But it is only for now, it is correct behaviour, because we only initialize storage of the pointers to the system call handlers, later we will fill it. Implementation of the `sys_ni_syscall` is pretty easy, it just returns [-errno](http://man7.org/linux/man-pages/man3/errno.3.html) or `-ENOSYS` in our case: + +```C +asmlinkage long sys_ni_syscall(void) +{ + return -ENOSYS; +} +``` + +The `-ENOSYS` error talks us that: + +``` +ENOSYS Function not implemented (POSIX.1) +``` + +Also note on `...` in the initialization of the `sys_call_table`. We can do it with the extension of the [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) which is called - [Designated Initializers](https://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html). This extension allows us to initialize elements in non-fixed order. As you can note, we include `asm/syscalls_64.h` header in the end of the array. This header file is generated by the special script that placed in the [arch/x86/entry/syscalls/syscalltbl.sh](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscalltbl.sh) and generates our header file from the [syscall table](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl). The `asm/syscalls_64.h` contains definitions of the following macros: + +```C +__SYSCALL_COMMON(0, sys_read, sys_read) +__SYSCALL_COMMON(1, sys_write, sys_write) +__SYSCALL_COMMON(2, sys_open, sys_open) +__SYSCALL_COMMON(3, sys_close, sys_close) +__SYSCALL_COMMON(5, sys_newfstat, sys_newfstat) +... +... +... +``` + +The `__SYSCALL_COMMON` macro defined in the same source code [file](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscall_64.c) and expands to the `__SYSCALL_64` macro which expands to the function definition: + +```C +#define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat) +#define __SYSCALL_64(nr, sym, compat) [nr] = sym, +``` + +So, after this, our `sys_call_table` takes the following form: + +```C +asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = { + [0 ... __NR_syscall_max] = &sys_ni_syscall, + [0] = sys_read, + [1] = sys_write, + [2] = sys_open, + ... + ... + ... +}; +``` + +After this all elements that points to the non-implemented system calls will contain address of the `sys_ni_syscall` function that just returns `-ENOSYS` as we saw above and other elements will point to the `sys_syscall_name` functions. + +For this moment we have already filled system call table and the Linux kernel knows where is the certain system call handler. But the Linux kernel does not call a `sys_syscall_name` function right after it got a control to handle a system call from a user space application. Remember the [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) about interrupts and interrupt handling. When the Linux kernel gets the control to handle an interrupt, it had to do some preparations like save user space registers, switch to a new stack and many more before it will call an interrupt handler. There is the same situation with the system call handling. The preparation before a system call handling is an one thing, but before the Linux kernel will start to do these preparations, the entry point of a system call must be initailized and only than Linux kernel knows where to handle this preparations. In the next paragraph we will see the process of the initialization of the system call entry in the Linux kernel. + +Initialization of the system call entry +-------------------------------------------------------------------------------- + +When a system call occurs in the system, where is the first bytes of code that starts to handle it? As we can read in the intel manual - [64-ia-32-architectures-software-developer-vol-2b-manual](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html): + +``` +SYSCALL invokes an OS system-call handler at privilege level 0. +It does so by loading RIP from the IA32_LSTAR MSR +``` + +it means that we need to put system call entry to the `IA32_LSTAR` [model specific register](https://en.wikipedia.org/wiki/Model-specific_register). This operation takes place during Linux kernel initialization process. If you have read the fourth [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-4.html) of the chapter that describes interrupts and interrupt handling in the Linux kernel, you know that Linux kernel calls `trap_init` function during the initialization process. This function defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file and executes initialization of the `non-early` exceptions handlres like divide error, [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) error and etc. Besides the initialization of the `non-early` exceptions handlers, this function calls the `cpu_init` function from the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/blob/arch/x86/kernel/cpu/common.c) source code file which besides initialization of `per-cpu` state, calls the `syscall_init` function from the same source code file. + +That's just this function does all work for us by initialization of the system call entry point. Let's look on the implementation of this function. It does not take parameters and first of all fills two model specific registers: + +```C +wrmsrl(MSR_STAR, ((u64)__USER32_CS)<<48 | ((u64)__KERNEL_CS)<<32); +wrmsrl(MSR_LSTAR, entry_SYSCALL_64); +``` + +The first model specific register - `MSR_STAR` contains `63:48` bits of the user code segmet. This bits will be loaded to the `CS` and `SS` segment registers for the `sysret` instruction which provides functionality to return from a system call to user code with the related privilege. Also the `MSR_STAR` contains `47:32` bits from the kernel code that will be used as the base selector for `CS` and `SS` segment registers when user space application will execut a system call. In the second line of code we fill the `MSR_LSTAR` register with the `entry_SYSCALL_64` symbol that represents system call entry. The `entry_SYSCALL_64` defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file and contains code related to the preparation before a handler of a system call will be executed (I already wrote about these preparations, read above). We will not consider the `entry_SYSCALL_64` now, but will return to it later in this chapter. + +After we have set the entry point for system calls, we need to set following model specific registers: + +* `MSR_CSTAR` - target `rip` for the compability mode callers; +* `MSR_IA32_SYSENTER_CS` - target `cs` for the `sysenter` instruction; +* `MSR_IA32_SYSENTER_ESP` - target `esp` for the `sysenter` instruction; +* `MSR_IA32_SYSENTER_EIP` - target `eip` for the `sysenter` instruction. + +Values of these model specific register depends on the `CONFIG_IA32_EMULATION` kernel configuration option. If this kernel configuration option is enabled, it allows to run legacy 32-bit programs under a 64-bit kernel. In the first case, if the `CONFIG_IA32_EMULATION` kernel configuration option is enabled, we fill these model specific registers with the entry point for the system calls the compability mode: + +```C +wrmsrl(MSR_CSTAR, entry_SYSCALL_compat); +``` + +and with the kernel code segment, put zero to the stack pointer and write the address of the `entry_SYSENTER_compat` symbol to the [instruction pointer](https://en.wikipedia.org/wiki/Program_counter): + +```C +wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS); +wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL); +wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat); +``` + +In another way, if the `CONFIG_IA32_EMULATION` kernel configuration option is disabled, we write `ignore_sysret` symbol to the `MSR_CSTAR`: + +```C +wrmsrl(MSR_CSTAR, ignore_sysret); +``` + +that defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file and just returns `-ENOSYS` error code: + +```assembly +ENTRY(ignore_sysret) + mov $-ENOSYS, %eax + sysret +END(ignore_sysret) +``` + +Now we need to fill `MSR_IA32_SYSENTER_CS`, `MSR_IA32_SYSENTER_ESP`, `MSR_IA32_SYSENTER_EIP` model specific registers as we did it in the previous code when the `CONFIG_IA32_EMULATION` kernel configuration option was enabled. In this case (when the `CONFIG_IA32_EMULATION` configuration option is not set) we fill the `MSR_IA32_SYSENTER_ESP` and the `MSR_IA32_SYSENTER_EIP` with zero and put invalid segment of the [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table) to the `MSR_IA32_SYSENTER_CS` model specific register: + +```C +wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG); +wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL); +wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL); +``` + +More about the `Global Descriptor Table` you can read in the second [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) of the chapter that describes booting process of the Linux kernel. + +In the end of the `syscall_init` function, we just mask flags in the [flags register](https://en.wikipedia.org/wiki/FLAGS_register) by the writing set of flags to the `MSR_SYSCALL_MASK` model specific register: + +```C +wrmsrl(MSR_SYSCALL_MASK, + X86_EFLAGS_TF|X86_EFLAGS_DF|X86_EFLAGS_IF| + X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT); +``` + +These flags will be cleared during syscall initialization. That's all, it is the end of the `syscall_init` function and it means that entry for system calls is ready to work. Now we can see what will occur when an user application executes the `syscall` instruction. + +Preparation before system call handler will be called +-------------------------------------------------------------------------------- + +As I already wrote before a system call or an interrupt handler will be called by the Linux kernel, need to do some preparations. The `idtentry` macro does these preparations before an exception handler will be executed, the `interrupt` macro does these preparations before an interrupt handler will be called and the `entry_SYSCALL_64` will do these preparations before a system call handler will be executed. + +The `entry_SYSCALL_64` defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file and starts from the following macro: + +```assembly +SWAPGS_UNSAFE_STACK +``` + +This macro defined in the [arch/x86/include/asm/irqflags.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irqflags.h) header file and expands to the `swapgs` instruction: + +```C +#define SWAPGS_UNSAFE_STACK swapgs +``` + +which is exchanges the current GS base register value with the value contained in the `MSR_KERNEL_GS_BASE ` model specific register. In other words we moved on the kernel stack. After this we put old stack pointer to the `rsp_scratch` [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable and setup stack pointer to the top of stack for the current processor: + +```assembly +movq %rsp, PER_CPU_VAR(rsp_scratch) +movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp +``` + +In the next step we push the stack segment and the old stack pointer to the stack: + +```assembly +pushq $__USER_DS +pushq PER_CPU_VAR(rsp_scratch) +``` + +After this we enable interrupts, because interrupts are off on entry and save general purpose [registers](https://en.wikipedia.org/wiki/Processor_register) (besides `bp`, `bx` and from `r12` to `r15`), flags, `-ENOSYS` for the non-implemented system call and code segment register on the stack: + +```assembly +ENABLE_INTERRUPTS(CLBR_NONE) + +pushq %r11 +pushq $__USER_CS +pushq %rcx +pushq %rax +pushq %rdi +pushq %rsi +pushq %rdx +pushq %rcx +pushq $-ENOSYS +pushq %r8 +pushq %r9 +pushq %r10 +pushq %r11 +sub $(6*8), %rsp +``` + +When a system call occurs from the user's application, general purpose registers have the following state: + +* `rax` - contains system call number; +* `rcx` - contains return address to the user space; +* `r11` - contains register flags; +* `rdi` - contains first argument of a system call handler; +* `rsi` - contains second argument of a system call handler; +* `rdx` - contains third argument of a system call handler; +* `r10` - contains fourth argument of a system call handler; +* `r8` - contains fifth argument of a system call handler; +* `r9` - contains sixth argument of a system call handler; + +Other general purpose registers (as `rbp`, `rbx` and from `r12` to `r15`) are callee-preserved in [C ABI](http://www.x86-64.org/documentation/abi.pdf)). So we push register flags on top of the stack, then user code segment, return address to the user space, system call number, first three arguments, dump error code for the non-implemented system call and other arguments on the stack. + +In the next step we check the `_TIF_WORK_SYSCALL_ENTRY` in the current `thread_info`: + +```assembly +testl $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS) +jnz tracesys +``` + +The `_TIF_WORK_SYSCALL_ENTRY` macro defined in the [arch/x86/include/asm/thread_info.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/thread_info.h) header file and provides set of the thread information flags that are related to the system calls tracing: + +```C +#define _TIF_WORK_SYSCALL_ENTRY \ + (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT | \ + _TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT | \ + _TIF_NOHZ) +``` + +We will not consider debugging/tracing related stuff in this chapter, but will see it in the separate chapter that will devoted to the debugging and tracing technics in the Linux kernel. As we did not just on the `tracesys` label, the next label is the `entry_SYSCALL_64_fastpath`. In the `entry_SYSCALL_64_fastpath` we check the `__SYSCALL_MASK` that defined in the [arch/x86/include/asm/unistd.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/unistd.h) header file and + +```C +# ifdef CONFIG_X86_X32_ABI +# define __SYSCALL_MASK (~(__X32_SYSCALL_BIT)) +# else +# define __SYSCALL_MASK (~0) +# endif +``` + +where the `__X32_SYSCALL_BIT` is + +```C +#define __X32_SYSCALL_BIT 0x40000000 +``` + +As we can see the `__SYSCALL_MASK` depends on the `CONFIG_X86_X32_ABI` kernel configuration option and represents mask for the 32-bit [ABI](https://en.wikipedia.org/wiki/Application_binary_interface) in the 64-bit kernel. + +So we check the value of the `__SYSCALL_MASK` and if the `CONFIG_X86_X32_ABI` is disabled we compare the value of the `rax` register the the maximum syscall number (`__NR_syscall_max`), in another way if the `CNOFIG_X86_X32_ABI` is enabled we mask `eax` register with the `__X32_SYSCALL_BIT` and do the same comparison: + +```assembly +#if __SYSCALL_MASK == ~0 + cmpq $__NR_syscall_max, %rax +#else + andl $__SYSCALL_MASK, %eax + cmpl $__NR_syscall_max, %eax +#endif +``` + +After this we check the result of the last comparison with the `ja` instruction that executes if `CF` an `ZF` flags are zero: + +```assembly +ja 1f +``` + +and if we have correct system call for this, we move fourth argument from the `r10` to the `rcx` to keep [x86_64 C ABI](http://www.x86-64.org/documentation/abi.pdf) and execute `call` instruction with the address of a system call handler: + +```assembly +movq %r10, %rcx +call *sys_call_table(, %rax, 8) +``` + +Note, the `sys_call_table` is an array that we saw above in this part. As we already know the `rax` general purpose register contains number of a system call and each element of the `sys_call_table` is 8-bytes. So we are using `*sys_call_table(, %rax, 8)` this notation to find correct offset in the `sys_call_table` array for the certain system call handler. + +That's all. We did all preparations and the system call handler was called for the certain interrupt handler, for example `sys_read`, `sys_write` or other system call handler that defined with the `SYSCALL_DEFINE[N]` macro in the Linux kernel code. + +Exit from a system call +-------------------------------------------------------------------------------- + +After a system call handler will finish its work, we will return back to the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S), right after where we have called a system call handler: + +```assembly +call *sys_call_table(, %rax, 8) +``` + +The next step as we've returned from a system call handler is to put return value of a system handler to the stack. We know that a system call returns result to the user program in the general purpose `rax` register, so we are moving its value after system call handler have finished its work to the stack: + +```C +movq %rax, RAX(%rsp) +``` + +on the `RAX` place. + +After this we can see the call of the `LOCKDEP_SYS_EXIT` macro from the [arch/x86/include/asm/irqflags.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irqflags.h): + +```assembly +LOCKDEP_SYS_EXIT +``` + +Implementation of this macro depends on the `CONFIG_DEBUG_LOCK_ALLOC` kernel configuration option that allows us to debug locks on the exit from a system call. And again, we will not consider it in this chapter, but will return to it in the separate. In the end of the `entry_SYSCALL_64` function we restore all general purpose registers besides `rxc` and `r11`, because the `rcx` register must contain return address to the application that called system call and the `r11` register contains old [flags register](https://en.wikipedia.org/wiki/FLAGS_register). After all general purpose registers are restored, we fill `rcx` with the return address, `r11` register with the falgs and `rsp` with the old stack pointer: + +```assembly +RESTORE_C_REGS_EXCEPT_RCX_R11 + +movq RIP(%rsp), %rcx +movq EFLAGS(%rsp), %r11 +movq RSP(%rsp), %rsp + +USERGS_SYSRET64 +``` + +In the end we just call the `USERGS_SYSRET64` macro that expands to the call of the `swapgs` instruction which exchanges again user `GS` and kernel `GS` and the `sysretq` instruction which executes exit from a system call handler: + +```C +#define USERGS_SYSRET64 \ + swapgs; \ + sysretq; +``` + +Now we know what occurs when an user application calls a system call. Full path of this process is following: + +* User application contains code that fills general purposer register with the values (system call number and arguments of this system call); +* Processor switches from the user mode to kernel mode and starts execution of the system call entry - `entry_SYSCALL_64`; +* `entry_SYSCALL_64` switches to the kernel stack and saves some general purpose registers, old stack and code segment, flags and etc... on the stack; +* `entry_SYSCALL_64` checks system call number in the `rax` register, searches a system call handler in the `sys_call_table` and calls it, if the number of a system call is correct; +* If a system call is not correct, jump on exit from system call; +* After a system call handler will finish its work, restore general purposer registers, old stack, flags and return address and exit from the `entry_SYSCALL_64` with the `sysretq` instruction. + +That's all. + +Conclusion +-------------------------------------------------------------------------------- + +This is the end of the second part about the system calls concept in the Linux kernel. In the previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) we saw theory about this concept from the user application view. In this part we continued to dive into the stuf which is related to the system call concept and saw what Linux kernel does when a system call occurs. + +If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-internals/issues/new). + +**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** + +Links +-------------------------------------------------------------------------------- + +* [system call](https://en.wikipedia.org/wiki/System_call) +* [write](http://man7.org/linux/man-pages/man2/write.2.html) +* [C standard library](https://en.wikipedia.org/wiki/GNU_C_Library) +* [list of cpu architectures](https://en.wikipedia.org/wiki/List_of_CPU_architectures) +* [x86_64](https://en.wikipedia.org/wiki/X86-64) +* [kbuild](https://www.kernel.org/doc/Documentation/kbuild/makefiles.txt) +* [typedef](https://en.wikipedia.org/wiki/Typedef) +* [errno](http://man7.org/linux/man-pages/man3/errno.3.html) +* [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) +* [model specific register](https://en.wikipedia.org/wiki/Model-specific_register) +* [intel 2b manual](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html) +* [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) +* [instruction pointer](https://en.wikipedia.org/wiki/Program_counter) +* [flags register](https://en.wikipedia.org/wiki/FLAGS_register) +* [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table) +* [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) +* [general purpose registers](https://en.wikipedia.org/wiki/Processor_register) +* [ABI](https://en.wikipedia.org/wiki/Application_binary_interface) +* [x86_64 C ABI](http://www.x86-64.org/documentation/abi.pdf) +* [previous chapter](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) From 6113643c9880c7319886e0eeed134c4630e870ca Mon Sep 17 00:00:00 2001 From: 0xAX <0xAX@users.noreply.github.com> Date: Sun, 30 Aug 2015 20:01:17 +0600 Subject: [PATCH 12/15] Update SUMMARY.md --- SUMMARY.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SUMMARY.md b/SUMMARY.md index e797e75..ab50749 100644 --- a/SUMMARY.md +++ b/SUMMARY.md @@ -33,7 +33,7 @@ * [Fixmaps and ioremap](mm/linux-mm-2.md) * [System calls](SysCall/README.md) * [Introduction to system calls](SysCall/syscall-1.md) - * [How the Linux kernel handles a system call]() + * [How the Linux kernel handles a system call](SysCall/syscall-2.md) * [SMP]() * [Concepts](Concepts/README.md) * [Per-CPU variables](Concepts/per-cpu.md) From b217838cf6579f11e47c76246f114a0d89807947 Mon Sep 17 00:00:00 2001 From: 0xAX <0xAX@users.noreply.github.com> Date: Sun, 30 Aug 2015 20:02:22 +0600 Subject: [PATCH 13/15] Update README.md --- SysCall/README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/SysCall/README.md b/SysCall/README.md index b8e2294..f877c81 100644 --- a/SysCall/README.md +++ b/SysCall/README.md @@ -4,3 +4,4 @@ This chapter describes the `system call` concept in the linux kernel. You will s couple of posts which describe the full cycle of the kernel loading process: * [Introduction to system call concept](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) - this part is introduction to the `system call` concept in the Linux kernel. +* [How the Linux kernel handles a system call](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-2.html) - this part describes how the Linux kernel handles a system call from an userspace application. From 251d184a87bc982dbeff7a9cef38409f1f1c74e5 Mon Sep 17 00:00:00 2001 From: 0xAX <0xAX@users.noreply.github.com> Date: Sun, 30 Aug 2015 20:10:27 +0600 Subject: [PATCH 14/15] Update syscall-2.md --- SysCall/syscall-2.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/SysCall/syscall-2.md b/SysCall/syscall-2.md index a2c6ef1..bafb5c8 100644 --- a/SysCall/syscall-2.md +++ b/SysCall/syscall-2.md @@ -1,7 +1,7 @@ System calls in the Linux kernel. Part 2. ================================================================================ -How the Linux kernel handles a system call +How does the Linux kernel handles a system call -------------------------------------------------------------------------------- The previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) was the first part of the chapter that describes [system call](https://en.wikipedia.org/wiki/System_call) concept in the Linux kernel. We have learned what is it a `system call` in the Linux kernel and in a operating system kernel in general, looked on this concept from the user space and even saw partly implementation of the [write](http://man7.org/linux/man-pages/man2/write.2.html) system call. In this part we will continue dive into this theme and as we usually did it in other chapters of this book - after some theory we will start to sink lower and lower, and go directly to the Linux kernel code. From 0ca9ef9190b02231aa9d0c3630d3990aba92bf7f Mon Sep 17 00:00:00 2001 From: Dave Willmer Date: Sun, 30 Aug 2015 20:02:21 +0100 Subject: [PATCH 15/15] Minor typos and grammatical fixes. --- SysCall/syscall-2.md | 91 ++++++++++++++++++++++---------------------- 1 file changed, 46 insertions(+), 45 deletions(-) diff --git a/SysCall/syscall-2.md b/SysCall/syscall-2.md index bafb5c8..f4ba8c7 100644 --- a/SysCall/syscall-2.md +++ b/SysCall/syscall-2.md @@ -1,12 +1,13 @@ System calls in the Linux kernel. Part 2. ================================================================================ -How does the Linux kernel handles a system call +How does the Linux kernel handle a system call -------------------------------------------------------------------------------- -The previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) was the first part of the chapter that describes [system call](https://en.wikipedia.org/wiki/System_call) concept in the Linux kernel. We have learned what is it a `system call` in the Linux kernel and in a operating system kernel in general, looked on this concept from the user space and even saw partly implementation of the [write](http://man7.org/linux/man-pages/man2/write.2.html) system call. In this part we will continue dive into this theme and as we usually did it in other chapters of this book - after some theory we will start to sink lower and lower, and go directly to the Linux kernel code. +The previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) was the first part of the chapter that describes the [system call](https://en.wikipedia.org/wiki/System_call) concepts in the Linux kernel. +In the previous part we learned what a system call is in the Linux kernel, and in operating systems in general. This was introduced from a user-space perspective, and part of the [write](http://man7.org/linux/man-pages/man2/write.2.html) system call implementation was discussed. In this part we continue our look at system calls, starting with some theory before moving onto the Linux kernel code. -An user application does not call the system call directly from our applications. We did not write `Hello world!` program like: +An user application does not make the system call directly from our applications. We did not write the `Hello world!` program like: ```C int main(int argc, char **argv) @@ -36,12 +37,12 @@ int main(int argc, char **argv) } ``` -But anyway, `write` is not directly system call and not kernel function. An application must fill general purpose registers with the correct values and in the fixed order and call `syscall` instruction to call real system call. In this part we will know, what occurs in the linux kernel when the processor met `syscall` instruction. +But anyway, `write` is not a direct system call and not a kernel function. An application must fill general purpose registers with the correct values in the correct order and use the `syscall` instruction to make the actual system call. In this part we will look at what occurs in the Linux kernel when the `syscall` instruction is met by the processor. Initialization of the system calls table -------------------------------------------------------------------------------- -From the previous part we know that system call concept is very similar to interrupt. Furthermore system calls implemented as software interrupts. So, when the processor handles `syscall` instruction from a user application, this instruction causes an exception which transfers control to an exception handler. As we know, all exception handlers (or in other words kernel [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) functions that will react on a exception) are placed in the kernel code. But how the Linux kernel searches address of the necessary system call handler for the related system call? Linux kernel contains special table which is called - system call table. The system call table represeted by the `sys_call_table` array in the Linux kernel which defined in the [arch/x86/entry/syscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscall_64.c) source code file. Let's look on its implementation: +From the previous part we know that system call concept is very similar to an interrupt. Furthermore, system calls are implemented as software interrupts. So, when the processor handles a `syscall` instruction from a user application, this instruction causes an exception which transfers control to an exception handler. As we know, all exception handlers (or in other words kernel [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) functions that will react on an exception) are placed in the kernel code. But how does the Linux kernel search for the address of the necessary system call handler for the related system call? The Linux kernel contains a special table called the `system call table`. The system call table is represented by the `sys_call_table` array in the Linux kernel which is defined in the [arch/x86/entry/syscall_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscall_64.c) source code file. Let's look at its implementation: ```C asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = { @@ -50,19 +51,19 @@ asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = { }; ``` -As we can see, the `sys_call_table` is array of `__NR_syscall_max + 1` size where `__NR_syscall_max` macro represents maximum number of the system calls for the certain [architecture](https://en.wikipedia.org/wiki/List_of_CPU_architectures). This book is about [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture, so for our case the `__NR_syscall_max` is `322` and this is correc number for now (when I'm writing this part, Linux kernel version is `4.2.0-rc8+`). We can see this macro in the generated by the [Kbuild](https://www.kernel.org/doc/Documentation/kbuild/makefiles.txt) during kernel compilation header file - `include/generated/asm-offsets.h`: +As we can see, the `sys_call_table` is an array of `__NR_syscall_max + 1` size where the `__NR_syscall_max` macro represents the maximum number of system calls for the given [architecture](https://en.wikipedia.org/wiki/List_of_CPU_architectures). This book is about the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture, so for our case the `__NR_syscall_max` is `322` and this is the correct number at the time of writing (current Linux kernel version is `4.2.0-rc8+`). We can see this macro in the header file generated by [Kbuild](https://www.kernel.org/doc/Documentation/kbuild/makefiles.txt) during kernel compilation - include/generated/asm-offsets.h`: ```C #define __NR_syscall_max 322 ``` -The same number of system calls in the [arch/x86/entry/syscalls/syscall_64.tbl](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L331) for the `x86_64`. The following are two things to which we must turn our attention are type of the `sys_call_table` array and initialization of elements of this array. First of all about type. The `sys_call_ptr_t` represents pointer to a system call table. It defined as [typedef](https://en.wikipedia.org/wiki/Typedef) for a function pointer that returns nothing and and does not take arguments: +There will be the same number of system calls in the [arch/x86/entry/syscalls/syscall_64.tbl](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl#L331) for the `x86_64`. There are two important topics here; the type of the `sys_call_table` array, and the initialization of elements in this array. First of all, the type. The `sys_call_ptr_t` represents a pointer to a system call table. It is defined as [typedef](https://en.wikipedia.org/wiki/Typedef) for a function pointer that returns nothing and and does not take arguments: ```C typedef void (*sys_call_ptr_t)(void); ``` -The second thing is initialization of the `sys_call_table` array. As we can see in the code above, all elements of our array that contains pointers to the system call handlers point to the `sys_ni_syscall`. The `sys_ni_syscall` function represents not-implemented system calls. Yes, for now all elements of the `sys_call_table` array point to the not implemented system calls. But it is only for now, it is correct behaviour, because we only initialize storage of the pointers to the system call handlers, later we will fill it. Implementation of the `sys_ni_syscall` is pretty easy, it just returns [-errno](http://man7.org/linux/man-pages/man3/errno.3.html) or `-ENOSYS` in our case: +The second thing is the initialization of the `sys_call_table` array. As we can see in the code above, all elements of our array that contain pointers to the system call handlers point to the `sys_ni_syscall`. The `sys_ni_syscall` function represents not-implemented system calls. To start with, all elements of the `sys_call_table` array point to the not-implemented system call. This is the correct initial behaviour, because we only initialize storage of the pointers to the system call handlers, it is populated later on. Implementation of the `sys_ni_syscall` is pretty easy, it just returns [-errno](http://man7.org/linux/man-pages/man3/errno.3.html) or `-ENOSYS` in our case: ```C asmlinkage long sys_ni_syscall(void) @@ -71,13 +72,13 @@ asmlinkage long sys_ni_syscall(void) } ``` -The `-ENOSYS` error talks us that: +The `-ENOSYS` error tells us that: ``` ENOSYS Function not implemented (POSIX.1) ``` -Also note on `...` in the initialization of the `sys_call_table`. We can do it with the extension of the [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) which is called - [Designated Initializers](https://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html). This extension allows us to initialize elements in non-fixed order. As you can note, we include `asm/syscalls_64.h` header in the end of the array. This header file is generated by the special script that placed in the [arch/x86/entry/syscalls/syscalltbl.sh](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscalltbl.sh) and generates our header file from the [syscall table](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl). The `asm/syscalls_64.h` contains definitions of the following macros: +Also a note on `...` in the initialization of the `sys_call_table`. We can do it with a [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) compiler extension called - [Designated Initializers](https://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html). This extension allows us to initialize elements in non-fixed order. As you can see, we include the `asm/syscalls_64.h` header at the end of the array. This header file is generated by the special script at [arch/x86/entry/syscalls/syscalltbl.sh](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscalltbl.sh) and generates our header file from the [syscall table](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscalls/syscall_64.tbl). The `asm/syscalls_64.h` contains definitions of the following macros: ```C __SYSCALL_COMMON(0, sys_read, sys_read) @@ -90,7 +91,7 @@ __SYSCALL_COMMON(5, sys_newfstat, sys_newfstat) ... ``` -The `__SYSCALL_COMMON` macro defined in the same source code [file](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscall_64.c) and expands to the `__SYSCALL_64` macro which expands to the function definition: +The `__SYSCALL_COMMON` macro is defined in the same source code [file](https://github.com/torvalds/linux/blob/master/arch/x86/entry/syscall_64.c) and expands to the `__SYSCALL_64` macro which expands to the function definition: ```C #define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat) @@ -111,39 +112,39 @@ asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = { }; ``` -After this all elements that points to the non-implemented system calls will contain address of the `sys_ni_syscall` function that just returns `-ENOSYS` as we saw above and other elements will point to the `sys_syscall_name` functions. +After this all elements that point to the non-implemented system calls will contain the address of the `sys_ni_syscall` function that just returns `-ENOSYS` as we saw above, and other elements will point to the `sys_syscall_name` functions. -For this moment we have already filled system call table and the Linux kernel knows where is the certain system call handler. But the Linux kernel does not call a `sys_syscall_name` function right after it got a control to handle a system call from a user space application. Remember the [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) about interrupts and interrupt handling. When the Linux kernel gets the control to handle an interrupt, it had to do some preparations like save user space registers, switch to a new stack and many more before it will call an interrupt handler. There is the same situation with the system call handling. The preparation before a system call handling is an one thing, but before the Linux kernel will start to do these preparations, the entry point of a system call must be initailized and only than Linux kernel knows where to handle this preparations. In the next paragraph we will see the process of the initialization of the system call entry in the Linux kernel. +At this point, we have filled the system call table and the Linux kernel knows where each system call handler is. But the Linux kernel does not call a `sys_syscall_name` function immediately after it is instructed to handle a system call from a user space application. Remember the [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) about interrupts and interrupt handling. When the Linux kernel gets the control to handle an interrupt, it had to do some preparations like save user space registers, switch to a new stack and many more tasks before it will call an interrupt handler. There is the same situation with the system call handling. The preparation for handling a system call is the first thing, but before the Linux kernel will start these preparations, the entry point of a system call must be initailized and only the Linux kernel knows how to perform this preparation. In the next paragraph we will see the process of the initialization of the system call entry in the Linux kernel. Initialization of the system call entry -------------------------------------------------------------------------------- -When a system call occurs in the system, where is the first bytes of code that starts to handle it? As we can read in the intel manual - [64-ia-32-architectures-software-developer-vol-2b-manual](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html): +When a system call occurs in the system, where are the first bytes of code that starts to handle it? As we can read in the Intel manual - [64-ia-32-architectures-software-developer-vol-2b-manual](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html): ``` SYSCALL invokes an OS system-call handler at privilege level 0. It does so by loading RIP from the IA32_LSTAR MSR ``` -it means that we need to put system call entry to the `IA32_LSTAR` [model specific register](https://en.wikipedia.org/wiki/Model-specific_register). This operation takes place during Linux kernel initialization process. If you have read the fourth [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-4.html) of the chapter that describes interrupts and interrupt handling in the Linux kernel, you know that Linux kernel calls `trap_init` function during the initialization process. This function defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file and executes initialization of the `non-early` exceptions handlres like divide error, [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) error and etc. Besides the initialization of the `non-early` exceptions handlers, this function calls the `cpu_init` function from the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/blob/arch/x86/kernel/cpu/common.c) source code file which besides initialization of `per-cpu` state, calls the `syscall_init` function from the same source code file. +it means that we need to put the system call entry in to the `IA32_LSTAR` [model specific register](https://en.wikipedia.org/wiki/Model-specific_register). This operation takes place during the Linux kernel initialization process. If you have read the fourth [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-4.html) of the chapter that describes interrupts and interrupt handling in the Linux kernel, you know that the Linux kernel calls the `trap_init` function during the initialization process. This function is defined in the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup.c) source code file and executes the initialization of the `non-early` exception handlers like divide error, [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) error etc. Besides the initialization of the `non-early` exceptions handlers, this function calls the `cpu_init` function from the [arch/x86/kernel/cpu/common.c](https://github.com/torvalds/linux/blob/master/blob/arch/x86/kernel/cpu/common.c) source code file which besides initialization of `per-cpu` state, calls the `syscall_init` function from the same source code file. -That's just this function does all work for us by initialization of the system call entry point. Let's look on the implementation of this function. It does not take parameters and first of all fills two model specific registers: +This function performs the initialization of the system call entry point. Let's look on the implementation of this function. It does not take parameters and first of all it fills two model specific registers: ```C wrmsrl(MSR_STAR, ((u64)__USER32_CS)<<48 | ((u64)__KERNEL_CS)<<32); wrmsrl(MSR_LSTAR, entry_SYSCALL_64); ``` -The first model specific register - `MSR_STAR` contains `63:48` bits of the user code segmet. This bits will be loaded to the `CS` and `SS` segment registers for the `sysret` instruction which provides functionality to return from a system call to user code with the related privilege. Also the `MSR_STAR` contains `47:32` bits from the kernel code that will be used as the base selector for `CS` and `SS` segment registers when user space application will execut a system call. In the second line of code we fill the `MSR_LSTAR` register with the `entry_SYSCALL_64` symbol that represents system call entry. The `entry_SYSCALL_64` defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file and contains code related to the preparation before a handler of a system call will be executed (I already wrote about these preparations, read above). We will not consider the `entry_SYSCALL_64` now, but will return to it later in this chapter. +The first model specific register - `MSR_STAR` contains `63:48` bits of the user code segment. These bits will be loaded to the `CS` and `SS` segment registers for the `sysret` instruction which provides functionality to return from a system call to user code with the related privilege. Also the `MSR_STAR` contains `47:32` bits from the kernel code that will be used as the base selector for `CS` and `SS` segment registers when user space applications execute a system call. In the second line of code we fill the `MSR_LSTAR` register with the `entry_SYSCALL_64` symbol that represents system call entry. The `entry_SYSCALL_64` is defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file and contains code related to the preparation peformed before a system call handler will be executed (I already wrote about these preparations, read above). We will not consider the `entry_SYSCALL_64` now, but will return to it later in this chapter. -After we have set the entry point for system calls, we need to set following model specific registers: +After we have set the entry point for system calls, we need to set the following model specific registers: * `MSR_CSTAR` - target `rip` for the compability mode callers; * `MSR_IA32_SYSENTER_CS` - target `cs` for the `sysenter` instruction; * `MSR_IA32_SYSENTER_ESP` - target `esp` for the `sysenter` instruction; * `MSR_IA32_SYSENTER_EIP` - target `eip` for the `sysenter` instruction. -Values of these model specific register depends on the `CONFIG_IA32_EMULATION` kernel configuration option. If this kernel configuration option is enabled, it allows to run legacy 32-bit programs under a 64-bit kernel. In the first case, if the `CONFIG_IA32_EMULATION` kernel configuration option is enabled, we fill these model specific registers with the entry point for the system calls the compability mode: +The values of these model specific register depend on the `CONFIG_IA32_EMULATION` kernel configuration option. If this kernel configuration option is enabled, it allows legacy 32-bit programs to run under a 64-bit kernel. In the first case, if the `CONFIG_IA32_EMULATION` kernel configuration option is enabled, we fill these model specific registers with the entry point for the system calls the compability mode: ```C wrmsrl(MSR_CSTAR, entry_SYSCALL_compat); @@ -163,7 +164,7 @@ In another way, if the `CONFIG_IA32_EMULATION` kernel configuration option is di wrmsrl(MSR_CSTAR, ignore_sysret); ``` -that defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file and just returns `-ENOSYS` error code: +that is defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file and just returns `-ENOSYS` error code: ```assembly ENTRY(ignore_sysret) @@ -172,7 +173,7 @@ ENTRY(ignore_sysret) END(ignore_sysret) ``` -Now we need to fill `MSR_IA32_SYSENTER_CS`, `MSR_IA32_SYSENTER_ESP`, `MSR_IA32_SYSENTER_EIP` model specific registers as we did it in the previous code when the `CONFIG_IA32_EMULATION` kernel configuration option was enabled. In this case (when the `CONFIG_IA32_EMULATION` configuration option is not set) we fill the `MSR_IA32_SYSENTER_ESP` and the `MSR_IA32_SYSENTER_EIP` with zero and put invalid segment of the [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table) to the `MSR_IA32_SYSENTER_CS` model specific register: +Now we need to fill `MSR_IA32_SYSENTER_CS`, `MSR_IA32_SYSENTER_ESP`, `MSR_IA32_SYSENTER_EIP` model specific registers as we did in the previous code when the `CONFIG_IA32_EMULATION` kernel configuration option was enabled. In this case (when the `CONFIG_IA32_EMULATION` configuration option is not set) we fill the `MSR_IA32_SYSENTER_ESP` and the `MSR_IA32_SYSENTER_EIP` with zero and put the invalid segment of the [Global Descriptor Table](https://en.wikipedia.org/wiki/Global_Descriptor_Table) to the `MSR_IA32_SYSENTER_CS` model specific register: ```C wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG); @@ -180,9 +181,9 @@ wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL); wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL); ``` -More about the `Global Descriptor Table` you can read in the second [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) of the chapter that describes booting process of the Linux kernel. +You can read more about the `Global Descriptor Table` in the second [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-2.html) of the chapter that describes the booting process of the Linux kernel. -In the end of the `syscall_init` function, we just mask flags in the [flags register](https://en.wikipedia.org/wiki/FLAGS_register) by the writing set of flags to the `MSR_SYSCALL_MASK` model specific register: +At the end of the `syscall_init` function, we just mask flags in the [flags register](https://en.wikipedia.org/wiki/FLAGS_register) by writing the set of flags to the `MSR_SYSCALL_MASK` model specific register: ```C wrmsrl(MSR_SYSCALL_MASK, @@ -190,26 +191,26 @@ wrmsrl(MSR_SYSCALL_MASK, X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT); ``` -These flags will be cleared during syscall initialization. That's all, it is the end of the `syscall_init` function and it means that entry for system calls is ready to work. Now we can see what will occur when an user application executes the `syscall` instruction. +These flags will be cleared during syscall initialization. That's all, it is the end of the `syscall_init` function and it means that system call entry is ready to work. Now we can see what will occur when an user application executes the `syscall` instruction. Preparation before system call handler will be called -------------------------------------------------------------------------------- -As I already wrote before a system call or an interrupt handler will be called by the Linux kernel, need to do some preparations. The `idtentry` macro does these preparations before an exception handler will be executed, the `interrupt` macro does these preparations before an interrupt handler will be called and the `entry_SYSCALL_64` will do these preparations before a system call handler will be executed. +As I already wrote, before a system call or an interrupt handler will be called by the Linux kernel we need to do some preparations. The `idtentry` macro performs the preparations required before an exception handler will be executed, the `interrupt` macro performs the preparations requires before an interrupt handler will be called and the `entry_SYSCALL_64` will do the preparations required before a system call handler will be executed. -The `entry_SYSCALL_64` defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file and starts from the following macro: +The `entry_SYSCALL_64` is defined in the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S) assembly file and starts from the following macro: ```assembly SWAPGS_UNSAFE_STACK ``` -This macro defined in the [arch/x86/include/asm/irqflags.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irqflags.h) header file and expands to the `swapgs` instruction: +This macro is defined in the [arch/x86/include/asm/irqflags.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/irqflags.h) header file and expands to the `swapgs` instruction: ```C #define SWAPGS_UNSAFE_STACK swapgs ``` -which is exchanges the current GS base register value with the value contained in the `MSR_KERNEL_GS_BASE ` model specific register. In other words we moved on the kernel stack. After this we put old stack pointer to the `rsp_scratch` [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable and setup stack pointer to the top of stack for the current processor: +which exchanges the current GS base register value with the value contained in the `MSR_KERNEL_GS_BASE ` model specific register. In other words we moved it on to the kernel stack. After this we point the old stack pointer to the `rsp_scratch` [per-cpu](http://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) variable and setup the stack pointer to point to the top of stack for the current processor: ```assembly movq %rsp, PER_CPU_VAR(rsp_scratch) @@ -223,7 +224,7 @@ pushq $__USER_DS pushq PER_CPU_VAR(rsp_scratch) ``` -After this we enable interrupts, because interrupts are off on entry and save general purpose [registers](https://en.wikipedia.org/wiki/Processor_register) (besides `bp`, `bx` and from `r12` to `r15`), flags, `-ENOSYS` for the non-implemented system call and code segment register on the stack: +After this we enable interrupts, because interrupts are `off` on entry and save the general purpose [registers](https://en.wikipedia.org/wiki/Processor_register) (besides `bp`, `bx` and from `r12` to `r15`), flags, `-ENOSYS` for the non-implemented system call and code segment register on the stack: ```assembly ENABLE_INTERRUPTS(CLBR_NONE) @@ -256,7 +257,7 @@ When a system call occurs from the user's application, general purpose registers * `r8` - contains fifth argument of a system call handler; * `r9` - contains sixth argument of a system call handler; -Other general purpose registers (as `rbp`, `rbx` and from `r12` to `r15`) are callee-preserved in [C ABI](http://www.x86-64.org/documentation/abi.pdf)). So we push register flags on top of the stack, then user code segment, return address to the user space, system call number, first three arguments, dump error code for the non-implemented system call and other arguments on the stack. +Other general purpose registers (as `rbp`, `rbx` and from `r12` to `r15`) are callee-preserved in [C ABI](http://www.x86-64.org/documentation/abi.pdf)). So we push register flags on the top of the stack, then user code segment, return address to the user space, system call number, first three arguments, dump error code for the non-implemented system call and other arguments on the stack. In the next step we check the `_TIF_WORK_SYSCALL_ENTRY` in the current `thread_info`: @@ -265,7 +266,7 @@ testl $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS) jnz tracesys ``` -The `_TIF_WORK_SYSCALL_ENTRY` macro defined in the [arch/x86/include/asm/thread_info.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/thread_info.h) header file and provides set of the thread information flags that are related to the system calls tracing: +The `_TIF_WORK_SYSCALL_ENTRY` macro is defined in the [arch/x86/include/asm/thread_info.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/thread_info.h) header file and provides set of the thread information flags that are related to the system calls tracing: ```C #define _TIF_WORK_SYSCALL_ENTRY \ @@ -274,7 +275,7 @@ The `_TIF_WORK_SYSCALL_ENTRY` macro defined in the [arch/x86/include/asm/thread_ _TIF_NOHZ) ``` -We will not consider debugging/tracing related stuff in this chapter, but will see it in the separate chapter that will devoted to the debugging and tracing technics in the Linux kernel. As we did not just on the `tracesys` label, the next label is the `entry_SYSCALL_64_fastpath`. In the `entry_SYSCALL_64_fastpath` we check the `__SYSCALL_MASK` that defined in the [arch/x86/include/asm/unistd.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/unistd.h) header file and +We will not consider debugging/tracing related stuff in this chapter, but will see it in the separate chapter that will be devoted to the debugging and tracing techniques in the Linux kernel. After the `tracesys` label, the next label is the `entry_SYSCALL_64_fastpath`. In the `entry_SYSCALL_64_fastpath` we check the `__SYSCALL_MASK` that is defined in the [arch/x86/include/asm/unistd.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/unistd.h) header file and ```C # ifdef CONFIG_X86_X32_ABI @@ -290,9 +291,9 @@ where the `__X32_SYSCALL_BIT` is #define __X32_SYSCALL_BIT 0x40000000 ``` -As we can see the `__SYSCALL_MASK` depends on the `CONFIG_X86_X32_ABI` kernel configuration option and represents mask for the 32-bit [ABI](https://en.wikipedia.org/wiki/Application_binary_interface) in the 64-bit kernel. +As we can see the `__SYSCALL_MASK` depends on the `CONFIG_X86_X32_ABI` kernel configuration option and represents the mask for the 32-bit [ABI](https://en.wikipedia.org/wiki/Application_binary_interface) in the 64-bit kernel. -So we check the value of the `__SYSCALL_MASK` and if the `CONFIG_X86_X32_ABI` is disabled we compare the value of the `rax` register the the maximum syscall number (`__NR_syscall_max`), in another way if the `CNOFIG_X86_X32_ABI` is enabled we mask `eax` register with the `__X32_SYSCALL_BIT` and do the same comparison: +So we check the value of the `__SYSCALL_MASK` and if the `CONFIG_X86_X32_ABI` is disabled we compare the value of the `rax` register to the maximum syscall number (`__NR_syscall_max`), alternatively if the `CNOFIG_X86_X32_ABI` is enabled we mask the `eax` register with the `__X32_SYSCALL_BIT` and do the same comparison: ```assembly #if __SYSCALL_MASK == ~0 @@ -303,33 +304,33 @@ So we check the value of the `__SYSCALL_MASK` and if the `CONFIG_X86_X32_ABI` is #endif ``` -After this we check the result of the last comparison with the `ja` instruction that executes if `CF` an `ZF` flags are zero: +After this we check the result of the last comparison with the `ja` instruction that executes if `CF` and `ZF` flags are zero: ```assembly ja 1f ``` -and if we have correct system call for this, we move fourth argument from the `r10` to the `rcx` to keep [x86_64 C ABI](http://www.x86-64.org/documentation/abi.pdf) and execute `call` instruction with the address of a system call handler: +and if we have the correct system call for this, we move the fourth argument from the `r10` to the `rcx` to keep [x86_64 C ABI](http://www.x86-64.org/documentation/abi.pdf) compliant and execute the `call` instruction with the address of a system call handler: ```assembly movq %r10, %rcx call *sys_call_table(, %rax, 8) ``` -Note, the `sys_call_table` is an array that we saw above in this part. As we already know the `rax` general purpose register contains number of a system call and each element of the `sys_call_table` is 8-bytes. So we are using `*sys_call_table(, %rax, 8)` this notation to find correct offset in the `sys_call_table` array for the certain system call handler. +Note, the `sys_call_table` is an array that we saw above in this part. As we already know the `rax` general purpose register contains the number of a system call and each element of the `sys_call_table` is 8-bytes. So we are using `*sys_call_table(, %rax, 8)` this notation to find the correct offset in the `sys_call_table` array for the given system call handler. -That's all. We did all preparations and the system call handler was called for the certain interrupt handler, for example `sys_read`, `sys_write` or other system call handler that defined with the `SYSCALL_DEFINE[N]` macro in the Linux kernel code. +That's all. We did all the required preparations and the system call handler was called for the given interrupt handler, for example `sys_read`, `sys_write` or other system call handler that is defined with the `SYSCALL_DEFINE[N]` macro in the Linux kernel code. Exit from a system call -------------------------------------------------------------------------------- -After a system call handler will finish its work, we will return back to the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S), right after where we have called a system call handler: +After a system call handler finishes its work, we will return back to the [arch/x86/entry/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S), right after where we have called the system call handler: ```assembly call *sys_call_table(, %rax, 8) ``` -The next step as we've returned from a system call handler is to put return value of a system handler to the stack. We know that a system call returns result to the user program in the general purpose `rax` register, so we are moving its value after system call handler have finished its work to the stack: +The next step after we've returned from a system call handler is to put the return value of a system handler on to the stack. We know that a system call returns the result to the user program in the general purpose `rax` register, so we are moving its value on to the stack after the system call handler has finished its work: ```C movq %rax, RAX(%rsp) @@ -343,7 +344,7 @@ After this we can see the call of the `LOCKDEP_SYS_EXIT` macro from the [arch/x8 LOCKDEP_SYS_EXIT ``` -Implementation of this macro depends on the `CONFIG_DEBUG_LOCK_ALLOC` kernel configuration option that allows us to debug locks on the exit from a system call. And again, we will not consider it in this chapter, but will return to it in the separate. In the end of the `entry_SYSCALL_64` function we restore all general purpose registers besides `rxc` and `r11`, because the `rcx` register must contain return address to the application that called system call and the `r11` register contains old [flags register](https://en.wikipedia.org/wiki/FLAGS_register). After all general purpose registers are restored, we fill `rcx` with the return address, `r11` register with the falgs and `rsp` with the old stack pointer: +The implementation of this macro depends on the `CONFIG_DEBUG_LOCK_ALLOC` kernel configuration option that allows us to debug locks on exit from a system call. And again, we will not consider it in this chapter, but will return to it in a separate one. In the end of the `entry_SYSCALL_64` function we restore all general purpose registers besides `rxc` and `r11`, because the `rcx` register must contain the return address to the application that called system call and the `r11` register contains the old [flags register](https://en.wikipedia.org/wiki/FLAGS_register). After all general purpose registers are restored, we fill `rcx` with the return address, `r11` register with the flags and `rsp` with the old stack pointer: ```assembly RESTORE_C_REGS_EXCEPT_RCX_R11 @@ -355,7 +356,7 @@ movq RSP(%rsp), %rsp USERGS_SYSRET64 ``` -In the end we just call the `USERGS_SYSRET64` macro that expands to the call of the `swapgs` instruction which exchanges again user `GS` and kernel `GS` and the `sysretq` instruction which executes exit from a system call handler: +In the end we just call the `USERGS_SYSRET64` macro that expands to the call of the `swapgs` instruction which exchanges again the user `GS` and kernel `GS` and the `sysretq` instruction which executes on exit from a system call handler: ```C #define USERGS_SYSRET64 \ @@ -363,12 +364,12 @@ In the end we just call the `USERGS_SYSRET64` macro that expands to the call of sysretq; ``` -Now we know what occurs when an user application calls a system call. Full path of this process is following: +Now we know what occurs when an user application calls a system call. The full path of this process is as follows: * User application contains code that fills general purposer register with the values (system call number and arguments of this system call); * Processor switches from the user mode to kernel mode and starts execution of the system call entry - `entry_SYSCALL_64`; * `entry_SYSCALL_64` switches to the kernel stack and saves some general purpose registers, old stack and code segment, flags and etc... on the stack; -* `entry_SYSCALL_64` checks system call number in the `rax` register, searches a system call handler in the `sys_call_table` and calls it, if the number of a system call is correct; +* `entry_SYSCALL_64` checks the system call number in the `rax` register, searches a system call handler in the `sys_call_table` and calls it, if the number of a system call is correct; * If a system call is not correct, jump on exit from system call; * After a system call handler will finish its work, restore general purposer registers, old stack, flags and return address and exit from the `entry_SYSCALL_64` with the `sysretq` instruction. @@ -377,7 +378,7 @@ That's all. Conclusion -------------------------------------------------------------------------------- -This is the end of the second part about the system calls concept in the Linux kernel. In the previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) we saw theory about this concept from the user application view. In this part we continued to dive into the stuf which is related to the system call concept and saw what Linux kernel does when a system call occurs. +This is the end of the second part about the system calls concept in the Linux kernel. In the previous [part](http://0xax.gitbooks.io/linux-insides/content/SysCall/syscall-1.html) we saw theory about this concept from the user application view. In this part we continued to dive into the stuff which is related to the system call concept and saw what the Linux kernel does when a system call occurs. If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-internals/issues/new).