mirror of
https://github.com/0xAX/linux-insides.git
synced 2025-01-08 23:01:05 +00:00
commit
f4a0f1b918
@ -388,7 +388,7 @@ This can lead to 3 different scenarios:
|
|||||||
|
|
||||||
Let's look at all three of these scenarios:
|
Let's look at all three of these scenarios:
|
||||||
|
|
||||||
1. `ss` has a correct address (0x10000). In this case we go to label [2](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L481):
|
* `ss` has a correct address (0x10000). In this case we go to label [2](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L481):
|
||||||
|
|
||||||
```
|
```
|
||||||
2: andw $~3, %dx
|
2: andw $~3, %dx
|
||||||
@ -403,7 +403,7 @@ Here we can see the alignment of `dx` (contains `sp` given by bootloader) to 4 b
|
|||||||
|
|
||||||
![stack](http://oi58.tinypic.com/16iwcis.jpg)
|
![stack](http://oi58.tinypic.com/16iwcis.jpg)
|
||||||
|
|
||||||
2. In the second scenario, (`ss` != `ds`). First of all put the [_end](https://github.com/torvalds/linux/blob/master/arch/x86/boot/setup.ld#L52) (address of end of setup code) value in `dx` and check the `loadflags` header field with the `testb` instruction to see whether we can use the heap or not. [loadflags](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L321) is a bitmask header which is defined as:
|
* In the second scenario, (`ss` != `ds`). First of all put the [_end](https://github.com/torvalds/linux/blob/master/arch/x86/boot/setup.ld#L52) (address of end of setup code) value in `dx` and check the `loadflags` header field with the `testb` instruction to see whether we can use the heap or not. [loadflags](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S#L321) is a bitmask header which is defined as:
|
||||||
|
|
||||||
```C
|
```C
|
||||||
#define LOADED_HIGH (1<<0)
|
#define LOADED_HIGH (1<<0)
|
||||||
@ -429,7 +429,7 @@ If the `CAN_USE_HEAP` bit is set, put `heap_end_ptr` in `dx` which points to `_e
|
|||||||
|
|
||||||
![stack](http://oi62.tinypic.com/dr7b5w.jpg)
|
![stack](http://oi62.tinypic.com/dr7b5w.jpg)
|
||||||
|
|
||||||
3. When `CAN_USE_HEAP` is not set, we just use a minimal stack from `_end` to `_end + STACK_SIZE`:
|
* When `CAN_USE_HEAP` is not set, we just use a minimal stack from `_end` to `_end + STACK_SIZE`:
|
||||||
|
|
||||||
![minimal stack](http://oi60.tinypic.com/28w051y.jpg)
|
![minimal stack](http://oi60.tinypic.com/28w051y.jpg)
|
||||||
|
|
||||||
|
@ -389,7 +389,7 @@ Now we can build the top level page table - `PML4` - with:
|
|||||||
|
|
||||||
Here we get the address stored in the `ebx` with `pgtable` offset and put it in `edi`. Next we put this address with offset `0x1007` in the `eax` register. `0x1007` is 4096 bytes (size of the PML4) + 7 (PML4 entry flags - `PRESENT+RW+USER`) and puts `eax` in `edi`. After this manipulation `edi` will contain the address of the first Page Directory Pointer Entry with flags - `PRESENT+RW+USER`.
|
Here we get the address stored in the `ebx` with `pgtable` offset and put it in `edi`. Next we put this address with offset `0x1007` in the `eax` register. `0x1007` is 4096 bytes (size of the PML4) + 7 (PML4 entry flags - `PRESENT+RW+USER`) and puts `eax` in `edi`. After this manipulation `edi` will contain the address of the first Page Directory Pointer Entry with flags - `PRESENT+RW+USER`.
|
||||||
|
|
||||||
In the next step we build 4 Page Directory entries in the Page Directory Pointer table, where the first entry will be with `0x7` flags and the others with `0x8`:
|
In the next step we build 4 Page Directory entries in the Page Directory Pointer table with `0x7` flags or present, write, userspace (`PRESENT WRITE | USER`):
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
leal pgtable + 0x1000(%ebx), %edi
|
leal pgtable + 0x1000(%ebx), %edi
|
||||||
@ -404,9 +404,9 @@ In the next step we build 4 Page Directory entries in the Page Directory Pointer
|
|||||||
|
|
||||||
We put the base address of the page directory pointer table in `edi` and the address of the first page directory pointer entry in `eax`. Put `4` in the `ecx` register, it will be a counter in the following loop and write the address of the first page directory pointer table entry to the `edi` register.
|
We put the base address of the page directory pointer table in `edi` and the address of the first page directory pointer entry in `eax`. Put `4` in the `ecx` register, it will be a counter in the following loop and write the address of the first page directory pointer table entry to the `edi` register.
|
||||||
|
|
||||||
After this `edi` will contain the address of the first page directory pointer entry with flags `0x7`. Next we just calculate the address of following page directory pointer entries with flags `0x8` and write their addresses to `edi`.
|
After this `edi` will contain the address of the first page directory pointer entry with flags `0x7`. Next we just calculate the address of following page directory pointer entries where each entry is 8 bytes, and write their addresses to `eax`.
|
||||||
|
|
||||||
The next step is building the `2048` page table entries by 2 megabytes:
|
The next step is building the `2048` page table entries with 2-MByte page:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
leal pgtable + 0x2000(%ebx), %edi
|
leal pgtable + 0x2000(%ebx), %edi
|
||||||
@ -419,7 +419,7 @@ The next step is building the `2048` page table entries by 2 megabytes:
|
|||||||
jnz 1b
|
jnz 1b
|
||||||
```
|
```
|
||||||
|
|
||||||
Here we do almost the same as in the previous example, except the first entry will be with flags - `$0x00000183` - `PRESENT + WRITE + MBZ` and all other entries with `0x8`. In the end we will have 2048 pages by 2 megabytes.
|
Here we do almost the same as in the previous example, all entries will be with flags - `$0x00000183` - `PRESENT + WRITE + MBZ`. In the end we will have 2048 pages with 2-MByte page.
|
||||||
|
|
||||||
Our early page table structure are done, it maps 4 gigabytes of memory and now we can put the address of the high-level page table - `PML4` - in `cr3` control register:
|
Our early page table structure are done, it maps 4 gigabytes of memory and now we can put the address of the high-level page table - `PML4` - in `cr3` control register:
|
||||||
|
|
||||||
|
@ -129,7 +129,7 @@ We pass `irq_stack_union` symbol to the `INIT_PER_CPU_VAR` macro which just conc
|
|||||||
INIT_PER_CPU(irq_stack_union);
|
INIT_PER_CPU(irq_stack_union);
|
||||||
```
|
```
|
||||||
|
|
||||||
It tells us that the address of the `init_per_cpu__irq_stack_union` will be `irq_stack_union + __per_cpu_load`. Now we need to understand where `init_per_cpu__irq_stack_union` and `__per_cpu_load` are and what they mean. The first `irq_stack_union` is defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h) with the `DECLARE_INIT_PER_CPU` macro which expands to call the `init_per_cpu_var` macro:
|
It tells us that the address of the `init_per_cpu__irq_stack_union` will be `irq_stack_union + __per_cpu_load`. Now we need to understand where `init_per_cpu__irq_stack_union` and `__per_cpu_load` are what they mean. The first `irq_stack_union` is defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h) with the `DECLARE_INIT_PER_CPU` macro which expands to call the `init_per_cpu_var` macro:
|
||||||
|
|
||||||
```C
|
```C
|
||||||
DECLARE_INIT_PER_CPU(irq_stack_union);
|
DECLARE_INIT_PER_CPU(irq_stack_union);
|
||||||
@ -266,7 +266,7 @@ union irq_stack_union {
|
|||||||
};
|
};
|
||||||
```
|
```
|
||||||
|
|
||||||
which defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h). We know that [unioun](http://en.wikipedia.org/wiki/Union_type) in the [C](http://en.wikipedia.org/wiki/C_%28programming_language%29) programming language is a data structure which stores only one field in a memory. We can see here that structure has first field - `gs_base` which is 40 bytes size and represents bottom of the `irq_stack`. So, after this our check with the `BUILD_BUG_ON` macro should end successfully. (you can read the first part about Linux kernel initialization [process](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) if you're interesting about the `BUILD_BUG_ON` macro).
|
which defined in the [arch/x86/include/asm/processor.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/processor.h). We know that [union](http://en.wikipedia.org/wiki/Union_type) in the [C](http://en.wikipedia.org/wiki/C_%28programming_language%29) programming language is a data structure which stores only one field in a memory. We can see here that structure has first field - `gs_base` which is 40 bytes size and represents bottom of the `irq_stack`. So, after this our check with the `BUILD_BUG_ON` macro should end successfully. (you can read the first part about Linux kernel initialization [process](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html) if you're interesting about the `BUILD_BUG_ON` macro).
|
||||||
|
|
||||||
After this we calculate new `canary` value based on the random number and [Time Stamp Counter](http://en.wikipedia.org/wiki/Time_Stamp_Counter):
|
After this we calculate new `canary` value based on the random number and [Time Stamp Counter](http://en.wikipedia.org/wiki/Time_Stamp_Counter):
|
||||||
|
|
||||||
@ -304,7 +304,7 @@ This macro defined in the [include/linux/irqflags.h](https://github.com/torvalds
|
|||||||
#endif
|
#endif
|
||||||
```
|
```
|
||||||
|
|
||||||
They are both similar and as you can see have only one difference: the `local_irq_disable` macro contains call of the `trace_hardirqs_off` when `CONFIG_TRACE_IRQFLAGS_SUPPORT` is enabled. There is special feature in the [lockdep](http://lwn.net/Articles/321663/) subsystem - `irq-flags tracing` for tracing `hardirq` and `stoftirq` state. In ourcase `lockdep` subsytem can give us interesting information about hard/soft irqs on/off events which are occurs in the system. The `trace_hardirqs_off` function defined in the [kernel/locking/lockdep.c](https://github.com/torvalds/linux/blob/master/kernel/locking/lockdep.c):
|
They are both similar and as you can see have only one difference: the `local_irq_disable` macro contains call of the `trace_hardirqs_off` when `CONFIG_TRACE_IRQFLAGS_SUPPORT` is enabled. There is special feature in the [lockdep](http://lwn.net/Articles/321663/) subsystem - `irq-flags tracing` for tracing `hardirq` and `stoftirq` state. In our case `lockdep` subsytem can give us interesting information about hard/soft irqs on/off events which are occurs in the system. The `trace_hardirqs_off` function defined in the [kernel/locking/lockdep.c](https://github.com/torvalds/linux/blob/master/kernel/locking/lockdep.c):
|
||||||
|
|
||||||
```C
|
```C
|
||||||
void trace_hardirqs_off(void)
|
void trace_hardirqs_off(void)
|
||||||
@ -314,7 +314,7 @@ void trace_hardirqs_off(void)
|
|||||||
EXPORT_SYMBOL(trace_hardirqs_off);
|
EXPORT_SYMBOL(trace_hardirqs_off);
|
||||||
```
|
```
|
||||||
|
|
||||||
and just calls `trace_hardirqs_off_caller` function. The `trace_hardirqs_off_caller` checks the `hardirqs_enabled` filed of the current process increment the `redundant_hardirqs_off` if call of the `local_irq_disable` was redundant or the `hardirqs_off_events` if it was not. These two fields and other `lockdep` statistic related fields are defined in the [kernel/locking/lockdep_insides.h](https://github.com/torvalds/linux/blob/master/kernel/locking/lockdep_insides.h) and located in the `lockdep_stats` structure:
|
and just calls `trace_hardirqs_off_caller` function. The `trace_hardirqs_off_caller` checks the `hardirqs_enabled` field of the current process and increases the `redundant_hardirqs_off` if call of the `local_irq_disable` was redundant or the `hardirqs_off_events` if it was not. These two fields and other `lockdep` statistic related fields are defined in the [kernel/locking/lockdep_insides.h](https://github.com/torvalds/linux/blob/master/kernel/locking/lockdep_insides.h) and located in the `lockdep_stats` structure:
|
||||||
|
|
||||||
```C
|
```C
|
||||||
struct lockdep_stats {
|
struct lockdep_stats {
|
||||||
@ -371,7 +371,7 @@ static inline void native_irq_disable(void)
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
And you already must remember that `cli` instruction clears the [IF](http://en.wikipedia.org/wiki/Interrupt_flag) flag which determines ability of a processor to handle and interrupt or an exception. Besides the `local_irq_disable`, as you already can know there is an inverse macr - `local_irq_enable`. This macro has the same tracing mechanism and very similar on the `local_irq_enable`, but as you can understand from its name, it enables interrupts with the `sti` instruction:
|
And you already must remember that `cli` instruction clears the [IF](http://en.wikipedia.org/wiki/Interrupt_flag) flag which determines ability of a processor to handle an interrupt or an exception. Besides the `local_irq_disable`, as you already can know there is an inverse macro - `local_irq_enable`. This macro has the same tracing mechanism and very similar on the `local_irq_enable`, but as you can understand from its name, it enables interrupts with the `sti` instruction:
|
||||||
|
|
||||||
```C
|
```C
|
||||||
static inline void native_irq_enable(void)
|
static inline void native_irq_enable(void)
|
||||||
|
@ -4,7 +4,7 @@ Interrupts and Interrupt Handling. Part 3.
|
|||||||
Interrupt handlers
|
Interrupt handlers
|
||||||
--------------------------------------------------------------------------------
|
--------------------------------------------------------------------------------
|
||||||
|
|
||||||
This is the third part of the [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) about an interrupts and an exceptions handling and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) we stoped in the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blame/master/arch/x86/kernel/setup.c) on the setting of the two exceptions handlers for the two following exceptions:
|
This is the third part of the [chapter](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) about an interrupts and an exceptions handling and in the previous [part](http://0xax.gitbooks.io/linux-insides/content/interrupts/index.html) we stopped in the `setup_arch` function from the [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blame/master/arch/x86/kernel/setup.c) on the setting of the two exceptions handlers for the two following exceptions:
|
||||||
|
|
||||||
* `#DB` - debug exception, transfers control from the interrupted process to the debug handler;
|
* `#DB` - debug exception, transfers control from the interrupted process to the debug handler;
|
||||||
* `#BP` - breakpoint exception, caused by the `int 3` instruction.
|
* `#BP` - breakpoint exception, caused by the `int 3` instruction.
|
||||||
@ -104,14 +104,14 @@ As you can note, the `set_intr_gate_ist` and `set_system_intr_gate_ist` function
|
|||||||
* `&debug`;
|
* `&debug`;
|
||||||
* `&int3`.
|
* `&int3`.
|
||||||
|
|
||||||
You will not find these functions in the C code. All that can be found in in the `*.c/*.h` files only definition of this functions in the [arch/x86/include/asm/traps.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/traps.h):
|
You will not find these functions in the C code. All that can be found in the `*.c/*.h` files only definition of this functions in the [arch/x86/include/asm/traps.h](https://github.com/torvalds/linux/tree/master/arch/x86/include/asm/traps.h):
|
||||||
|
|
||||||
```C
|
```C
|
||||||
asmlinkage void debug(void);
|
asmlinkage void debug(void);
|
||||||
asmlinkage void int3(void);
|
asmlinkage void int3(void);
|
||||||
```
|
```
|
||||||
|
|
||||||
But we can see `asmlinkage` descriptor here. The `asmlinkage` is the special specificator of the [gcc](http://en.wikipedia.org/wiki/GNU_Compiler_Collection). Actually for a `C` functions which are will be called from assembly, we need in explicit declaration of the function calling convention. In our case, if function maked with `asmlinkage` descriptor, then `gcc` will compile the function to retrieve parameters from stack. So, both handlers are defined in the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) assembly source code file with the `idtentry` macro:
|
But we can see `asmlinkage` descriptor here. The `asmlinkage` is the special specificator of the [gcc](http://en.wikipedia.org/wiki/GNU_Compiler_Collection). Actually for a `C` functions which are called from assembly, we need in explicit declaration of the function calling convention. In our case, if function maked with `asmlinkage` descriptor, then `gcc` will compile the function to retrieve parameters from stack. So, both handlers are defined in the [arch/x86/kernel/entry_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/entry_64.S) assembly source code file with the `idtentry` macro:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
|
idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
|
||||||
@ -199,7 +199,7 @@ The `pushq_cfi` macro defined in the [arch/x86/include/asm/dwarf2.h](https://git
|
|||||||
.endm
|
.endm
|
||||||
```
|
```
|
||||||
|
|
||||||
Pay attention on the `$-1`. We already know that when an exception occrus, the processor pushes `ss`, `rsp`, `rflags`, `cs` and `rip` on the stack:
|
Pay attention on the `$-1`. We already know that when an exception occurs, the processor pushes `ss`, `rsp`, `rflags`, `cs` and `rip` on the stack:
|
||||||
|
|
||||||
```C
|
```C
|
||||||
#define RIP 16*8
|
#define RIP 16*8
|
||||||
@ -239,14 +239,14 @@ After this we check `paranoid` and if it is set we check first three `CPL` bits.
|
|||||||
.endif
|
.endif
|
||||||
```
|
```
|
||||||
|
|
||||||
If we came from userspace we jump on the label `1` which starts from the `call error_entry` instruction. The `error_entry` saves all registers in the `pt_regs` structure which presetens an interrupt/exception stack frame and defined in the [arch/x86/include/uapi/asm/ptrace.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/ptrace.h). It saves common and extra registers on the stack with the:
|
If we came from userspace we jump on the label `1` which starts from the `call error_entry` instruction. The `error_entry` saves all registers in the `pt_regs` structure which presents an interrupt/exception stack frame and defined in the [arch/x86/include/uapi/asm/ptrace.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/ptrace.h). It saves common and extra registers on the stack with the:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
SAVE_C_REGS 8
|
SAVE_C_REGS 8
|
||||||
SAVE_EXTRA_REGS 8
|
SAVE_EXTRA_REGS 8
|
||||||
```
|
```
|
||||||
|
|
||||||
from `rdi` to `r15` and executes [swapgs](http://www.felixcloutier.com/x86/SWAPGS.html) instruction. This instruction provides a method to for the Linux kernel to obtain a pointer to the kernel data structures and save the user's `gsbase`. After this we will exit from the `error_entry` with the `ret` instruction. After the `error_entry` finished to execute, since we came from userspace we need to switch on kernel interrupt stack:
|
from `rdi` to `r15` and executes [swapgs](http://www.felixcloutier.com/x86/SWAPGS.html) instruction. This instruction provides a method for the Linux kernel to obtain a pointer to the kernel data structures and save the user's `gsbase`. After this we will exit from the `error_entry` with the `ret` instruction. After the `error_entry` finished to execute, since we came from userspace we need to switch on kernel interrupt stack:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
movq %rsp,%rdi
|
movq %rsp,%rdi
|
||||||
@ -277,7 +277,7 @@ and put pointer of the `pt_regs` again in the `rdi`, and in the last step we cal
|
|||||||
call \do_sym
|
call \do_sym
|
||||||
```
|
```
|
||||||
|
|
||||||
So, realy exceptions handlers are `do_debug` and `do_int3` functions. We will see these function in this part, but little later. First of all let's look on the preparations before a processor will transfer control to an interrupt handler. In another way if `paranoid` is set, but it is not 1, we call `paranoid_entry` which makes almost the same that `error_entry`, but it checks current mode with more slow but accurate way:
|
So, real exceptions handlers are `do_debug` and `do_int3` functions. We will see these function in this part, but little later. First of all let's look on the preparations before a processor will transfer control to an interrupt handler. In another way if `paranoid` is set, but it is not 1, we call `paranoid_entry` which makes almost the same that `error_entry`, but it checks current mode with more slow but accurate way:
|
||||||
|
|
||||||
```assembly
|
```assembly
|
||||||
ENTRY(paranoid_entry)
|
ENTRY(paranoid_entry)
|
||||||
@ -434,9 +434,9 @@ exit:
|
|||||||
ist_exit(regs, prev_state);
|
ist_exit(regs, prev_state);
|
||||||
```
|
```
|
||||||
|
|
||||||
In the end we disabled `irqs`, decrement value of the `debug_stack_usage` and exit from the exception handler with the `ist_exit` function.
|
In the end we disable `irqs`, decrease value of the `debug_stack_usage` and exit from the exception handler with the `ist_exit` function.
|
||||||
|
|
||||||
The second exception handler is `do_int3` defined in the same source code file - [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). In the `do_int3` we makes almost the same that in the `do_debug` handler. We get the previous state with the `ist_enter`, increment and decrement the `debug_stack_usage` per-cpu variable, enabled and disable local interrupts. But of course there is one difference between these two handlers. We need to lock and than sync processor cores during breakpoint patching.
|
The second exception handler is `do_int3` defined in the same source code file - [arch/x86/kernel/traps.c](https://github.com/torvalds/linux/tree/master/arch/x86/kernel/traps.c). In the `do_int3` we make almost the same that in the `do_debug` handler. We get the previous state with the `ist_enter`, increase and decrease the `debug_stack_usage` per-cpu variable, enable and disable local interrupts. But of course there is one difference between these two handlers. We need to lock and then sync processor cores during breakpoint patching.
|
||||||
|
|
||||||
That's all.
|
That's all.
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user