commit
aad2a72f77
@ -0,0 +1,7 @@
|
||||
# Internal `system` structures of the Linux kernel
|
||||
|
||||
This is not usual chapter of `linux-insides`. As you may understand from the title, it mostly describes
|
||||
internal `system` structures of the Linux kernel. Like `Interrupt Descriptor Table`, `Global Descriptor
|
||||
Table` and many many more.
|
||||
|
||||
Most of information is taken from official [Intel](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html) and [AMD](http://developer.amd.com/resources/developer-guides-manuals/) manuals.
|
@ -0,0 +1,190 @@
|
||||
interrupt-descriptor table (IDT)
|
||||
================================================================================
|
||||
|
||||
Three general interrupt & exceptions sources:
|
||||
|
||||
* Exceptions - sync;
|
||||
* Software interrupts - sync;
|
||||
* External interrupts - async.
|
||||
|
||||
Types of Exceptions:
|
||||
|
||||
* Faults - are precise exceptions reported on the boundary `before` the instruction causing the exception. The saved `%rip` points to the faulting instruction;
|
||||
* Traps - are precise exceptions reported on the boundary `following` the instruction causing the exception. The same with `%rip`;
|
||||
* Aborts - are imprecise exceptions. Because they are imprecise, aborts typically do not allow reliable program restart.
|
||||
|
||||
`Maskable` interrupts trigger the interrupt-handling mechanism only when RFLAGS.IF=1. Otherwise they are held pending for as long as the RFLAGS.IF bit is cleared to 0.
|
||||
|
||||
`Nonmaskable` interrupts (NMI) are unaffected by the value of the rFLAGS.IF bit. However, the occurrence of an NMI masks further NMIs until an IRET instruction is executed.
|
||||
|
||||
Specific exception and interrupt sources are assigned a fixed vector-identification number (also called an “interrupt vector” or simply “vector”). The interrupt vector is used by the interrupt-handling mechanism to locate the system-software service routine assigned to the exception or interrupt. Up to
|
||||
256 unique interrupt vectors are available. The first 32 vectors are reserved for predefined exception and interrupt conditions. They are defined in the [arch/x86/include/asm/traps.h](http://lxr.free-electrons.com/source/arch/x86/include/asm/traps.h#L121) header file:
|
||||
|
||||
```
|
||||
/* Interrupts/Exceptions */
|
||||
enum {
|
||||
X86_TRAP_DE = 0, /* 0, Divide-by-zero */
|
||||
X86_TRAP_DB, /* 1, Debug */
|
||||
X86_TRAP_NMI, /* 2, Non-maskable Interrupt */
|
||||
X86_TRAP_BP, /* 3, Breakpoint */
|
||||
X86_TRAP_OF, /* 4, Overflow */
|
||||
X86_TRAP_BR, /* 5, Bound Range Exceeded */
|
||||
X86_TRAP_UD, /* 6, Invalid Opcode */
|
||||
X86_TRAP_NM, /* 7, Device Not Available */
|
||||
X86_TRAP_DF, /* 8, Double Fault */
|
||||
X86_TRAP_OLD_MF, /* 9, Coprocessor Segment Overrun */
|
||||
X86_TRAP_TS, /* 10, Invalid TSS */
|
||||
X86_TRAP_NP, /* 11, Segment Not Present */
|
||||
X86_TRAP_SS, /* 12, Stack Segment Fault */
|
||||
X86_TRAP_GP, /* 13, General Protection Fault */
|
||||
X86_TRAP_PF, /* 14, Page Fault */
|
||||
X86_TRAP_SPURIOUS, /* 15, Spurious Interrupt */
|
||||
X86_TRAP_MF, /* 16, x87 Floating-Point Exception */
|
||||
X86_TRAP_AC, /* 17, Alignment Check */
|
||||
X86_TRAP_MC, /* 18, Machine Check */
|
||||
X86_TRAP_XF, /* 19, SIMD Floating-Point Exception */
|
||||
X86_TRAP_IRET = 32, /* 32, IRET Exception */
|
||||
};
|
||||
```
|
||||
|
||||
Error Codes
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The processor exception-handling mechanism reports error and status information for some exceptions using an error code. The error code is pushed onto the stack by the exception-mechanism during the control transfer into the exception handler. The error code has two formats:
|
||||
|
||||
* most error-reporting exceptions format;
|
||||
* page fault format.
|
||||
|
||||
Here is format of selector error code:
|
||||
|
||||
```
|
||||
31 16 15 3 2 1 0
|
||||
+-------------------------------------------------------------------------------+
|
||||
| | | T | I | E |
|
||||
| Reserved | Selector Index | - | D | X |
|
||||
| | | I | T | T |
|
||||
+-------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
Where:
|
||||
|
||||
* `EXT` - If this bit is set to 1, the exception source is external to the processor. If cleared to 0, the exception source is internal to the processor;
|
||||
* `IDT` - If this bit is set to 1, the error-code selector-index field references a gate descriptor located in the `interrupt-descriptor table`. If cleared to 0, the selector-index field references a descriptor in either the `global-descriptor table` or local-descriptor table `LDT`, as indicated by the `TI` bit;
|
||||
* `TI` - If this bit is set to 1, the error-code selector-index field references a descriptor in the `LDT`. If cleared to 0, the selector-index field references a descriptor in the `GDT`.
|
||||
* `Selector Index` - The selector-index field specifies the index into either the `GDT`, `LDT`, or `IDT`, as specified by the `IDT` and `TI` bits.
|
||||
|
||||
Page-Fault Error Code format is:
|
||||
|
||||
```
|
||||
31 4 3 2 1 0
|
||||
+-------------------------------------------------------------------------------+
|
||||
| | | R | U | R | - |
|
||||
| Reserved | I/D | S | - | - | P |
|
||||
| | | V | S | W | - |
|
||||
+-------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
Where:
|
||||
|
||||
* `I/D` - If this bit is set to 1, it indicates that the access that caused the page fault was an instruction fetch;
|
||||
* `RSV` - If this bit is set to 1, the page fault is a result of the processor reading a 1 from a reserved field within a page-translation-table entry;
|
||||
* `U/S` - If this bit is cleared to 0, an access in supervisor mode (`CPL=0, 1, or 2`) caused the page fault. If this bit is set to 1, an access in user mode (CPL=3) caused the page fault;
|
||||
* `R/W` - If this bit is cleared to 0, the access that caused the page fault is a memory read. If this bit is set to 1, the memory access that caused the page fault was a write;
|
||||
* `P` - If this bit is cleared to 0, the page fault was caused by a not-present page. If this bit is set to 1, the page fault was caused by a page-protection violation.
|
||||
|
||||
Interrupt Control Transfers
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
The IDT may contain any of three kinds of gate descriptors:
|
||||
|
||||
* `Task Gate` - contains the segment selector for a TSS for an exception and/or interrupt handler task;
|
||||
* `Interrupt Gate` - contains segment selector and offset that the processor uses to transfer program execution to a handler procedure in an interrupt handler code segment;
|
||||
* `Trap Gate` - contains segment selector and offset that the processor uses to transfer program execution to a handler procedure in an exception handler code segment.
|
||||
|
||||
General format of gates is:
|
||||
|
||||
```
|
||||
127 96
|
||||
+-------------------------------------------------------------------------------+
|
||||
| |
|
||||
| Reserved |
|
||||
| |
|
||||
+--------------------------------------------------------------------------------
|
||||
95 64
|
||||
+-------------------------------------------------------------------------------+
|
||||
| |
|
||||
| Offset 63..32 |
|
||||
| |
|
||||
+-------------------------------------------------------------------------------+
|
||||
63 48 47 46 44 42 39 34 32
|
||||
+-------------------------------------------------------------------------------+
|
||||
| | | D | | | | | | |
|
||||
| Offset 31..16 | P | P | 0 |Type |0 0 0 | 0 | 0 | IST |
|
||||
| | | L | | | | | | |
|
||||
-------------------------------------------------------------------------------+
|
||||
31 16 15 0
|
||||
+-------------------------------------------------------------------------------+
|
||||
| | |
|
||||
| Segment Selector | Offset 15..0 |
|
||||
| | |
|
||||
+-------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
Where
|
||||
|
||||
* `Selector` - Segment Selector for destination code segment;
|
||||
* `Offset` - Offset to handler procedure entry point;
|
||||
* `DPL` - Descriptor Privilege Level;
|
||||
* `P` - Segment Present flag;
|
||||
* `IST` - Interrupt Stack Table;
|
||||
* `TYPE` - one of: Local descriptor-table (LDT) segment descriptor, Task-state segment (TSS) descriptor, Call-gate descriptor, Interrupt-gate descriptor, Trap-gate descriptor or Task-gate descriptor.
|
||||
|
||||
An `IDT` descriptor is represented by the following structure in the Linux kernel (only for `x86_64`):
|
||||
|
||||
```C
|
||||
struct gate_struct64 {
|
||||
u16 offset_low;
|
||||
u16 segment;
|
||||
unsigned ist : 3, zero0 : 5, type : 5, dpl : 2, p : 1;
|
||||
u16 offset_middle;
|
||||
u32 offset_high;
|
||||
u32 zero1;
|
||||
} __attribute__((packed));
|
||||
```
|
||||
|
||||
which is defined in the [arch/x86/include/asm/desc_defs.h](http://lxr.free-electrons.com/source/arch/x86/include/asm/desc_defs.h#L51) header file.
|
||||
|
||||
A task gate descriptor does not contain `IST` field and its format differs from interrupt/trap gates:
|
||||
|
||||
```C
|
||||
struct ldttss_desc64 {
|
||||
u16 limit0;
|
||||
u16 base0;
|
||||
unsigned base1 : 8, type : 5, dpl : 2, p : 1;
|
||||
unsigned limit1 : 4, zero0 : 3, g : 1, base2 : 8;
|
||||
u32 base3;
|
||||
u32 zero1;
|
||||
} __attribute__((packed));
|
||||
```
|
||||
|
||||
Exceptions During a Task Switch
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
An exception can occur during a task switch while loading a segment selector. Page faults can also occur when accessing a TSS. In these cases, the hardware task-switch mechanism completes loading the new task state from the TSS, and then triggers the appropriate exception mechanism.
|
||||
|
||||
**In long mode, an exception cannot occur during a task switch, because the hardware task-switch mechanism is disabled.**
|
||||
|
||||
Nonmaskable interrupt
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
**TODO**
|
||||
|
||||
API
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
**TODO**
|
||||
|
||||
Interrupt Stack Table
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
**TODO**
|
@ -0,0 +1,352 @@
|
||||
Synchronization primitives in the Linux kernel. Part 6.
|
||||
================================================================================
|
||||
|
||||
Introduction
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the sixth part of the chapter which describes [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_(computer_science)) in the Linux kernel and in the previous parts we finished to consider different [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) synchronization primitives. We will continue to learn synchronization primitives in this part and start to consider a similar synchronization primitive which can be used to avoid the `writer starvation` problem. The name of this synchronization primitive is - `seqlock` or `sequential locks`.
|
||||
|
||||
We know from the previous [part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-5.html) that [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) is a special lock mechanism which allows concurrent access for read-only operations, but an exclusive lock is needed for writing or modifying data. As we may guess, it may lead to a problem which is called `writer starvation`. In other words, a writer process can't acquire a lock as long as at least one reader process which aqcuired a lock holds it. So, in the situation when contention is high, it will lead to situation when a writer process which wants to acquire a lock will wait for it for a long time.
|
||||
|
||||
The `seqlock` synchronization primitive can help solve this problem.
|
||||
|
||||
As in all previous parts of this [book](https://0xax.gitbooks.io/linux-insides/content), we will try to consider this synchronization primitive from the theoretical side and only than we will consider [API](https://en.wikipedia.org/wiki/Application_programming_interface) provided by the Linux kernel to manipulate with `seqlocks`.
|
||||
|
||||
So, let's start.
|
||||
|
||||
Sequential lock
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
So, what is a `seqlock` synchronization primitive and how does it work? Let's try to answer on these questions in this paragraph. Actually `sequential locks` were introduced in the Linux kernel 2.6.x. Main point of this synchronization primitive is to provide fast and lock-free access to shared resources. Since the heart of `sequential lock` synchronization primitive is [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) synchronization primitive, `sequential locks` work in situations where the protected resources are small and simple. Additionally write access must be rare and also should be fast.
|
||||
|
||||
Work of this synchronization primitive is based on the sequence of events counter. Actually a `sequential lock` allows free access to a resource for readers, but each reader must check existence of conflicts with a writer. This synchronization primitive introduces a special counter. The main algorithm of work of `sequential locks` is simple: Each writer which acquired a sequential lock increments this counter and additionally acquires a [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html). When this writer finishes, it will release the acquired spinlock to give access to other writers and increment the counter of a sequential lock again.
|
||||
|
||||
Read only access works on the following principle, it gets the value of a `sequential lock` counter before it will enter into [critical section](https://en.wikipedia.org/wiki/Critical_section) and compares it with the value of the same `sequential lock` counter at the exit of critical section. If their values are equal, this means that there weren't writers for this period. If their values are not equal, this means that a writer has incremented the counter during the [critical section](https://en.wikipedia.org/wiki/Critical_section). This conflict means that reading of protected data must be repeated.
|
||||
|
||||
That's all. As we may see principle of work of `sequential locks` is simple.
|
||||
|
||||
```C
|
||||
unsigned int seq_counter_value;
|
||||
|
||||
do {
|
||||
seq_counter_value = get_seq_counter_val(&the_lock);
|
||||
//
|
||||
// do as we want here
|
||||
//
|
||||
} while (__retry__);
|
||||
```
|
||||
|
||||
Actually the Linux kernel does not provide `get_seq_counter_val()` function. Here it is just a stub. Like a `__retry__` too. As I already wrote above, we will see actual the [API](https://en.wikipedia.org/wiki/Application_programming_interface) for this in the next paragraph of this part.
|
||||
|
||||
Ok, now we know what a `seqlock` synchronization primitive is and how it is represented in the Linux kernel. In this case, we may go ahead and start to look at the [API](https://en.wikipedia.org/wiki/Application_programming_interface) which the Linux kernel provides for manipulation of synchronization primitives of this type.
|
||||
|
||||
Sequential lock API
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
So, now we know a little about `sequentional lock` synchronization primitive from theoretical side, let's look at its implementation in the Linux kernel. All `sequentional locks` [API](https://en.wikipedia.org/wiki/Application_programming_interface) are located in the [include/linux/seqlock.h](https://github.com/torvalds/linux/blob/master/include/linux/seqlock.h) header file.
|
||||
|
||||
First of all we may see that the a `sequential lock` machanism is represented by the following type:
|
||||
|
||||
```C
|
||||
typedef struct {
|
||||
struct seqcount seqcount;
|
||||
spinlock_t lock;
|
||||
} seqlock_t;
|
||||
```
|
||||
|
||||
As we may see the `seqlock_t` provides two fields. These fields represent a sequential lock counter, description of which we saw above and also a [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) which will protect data from other writers. Note that the `seqcount` counter represented as `seqcount` type. The `seqcount` is structure:
|
||||
|
||||
```C
|
||||
typedef struct seqcount {
|
||||
unsigned sequence;
|
||||
#ifdef CONFIG_DEBUG_LOCK_ALLOC
|
||||
struct lockdep_map dep_map;
|
||||
#endif
|
||||
} seqcount_t;
|
||||
```
|
||||
|
||||
which holds counter of a sequential lock and [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) related field.
|
||||
|
||||
As always in previous parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/), before we will consider an [API](https://en.wikipedia.org/wiki/Application_programming_interface) of `sequential lock` mechanism in the Linux kernel, we need to know how to initialize an instance of `seqlock_t`.
|
||||
|
||||
We saw in the previous parts that often the Linux kernel provides two approaches to execute initialization of the given synchronization primitive. The same situation with the `seqlock_t` structure. These approaches allows to initialize a `seqlock_t` in two following:
|
||||
|
||||
* `statically`;
|
||||
* `dynamically`.
|
||||
|
||||
ways. Let's look at the first approach. We are able to intialize a `seqlock_t` statically with the `DEFINE_SEQLOCK` macro:
|
||||
|
||||
```C
|
||||
#define DEFINE_SEQLOCK(x) \
|
||||
seqlock_t x = __SEQLOCK_UNLOCKED(x)
|
||||
```
|
||||
|
||||
which is defined in the [include/linux/seqlock.h](https://github.com/torvalds/linux/blob/master/include/linux/seqlock.h) header file. As we may see, the `DEFINE_SEQLOCK` macro takes one argument and expands to the definition and initialization of the `seqlock_t` structure. Initialization occurs with the help of the `__SEQLOCK_UNLOCKED` macro which is defined in the same source code file. Let's look at the implementation of this macro:
|
||||
|
||||
```C
|
||||
#define __SEQLOCK_UNLOCKED(lockname) \
|
||||
{ \
|
||||
.seqcount = SEQCNT_ZERO(lockname), \
|
||||
.lock = __SPIN_LOCK_UNLOCKED(lockname) \
|
||||
}
|
||||
```
|
||||
|
||||
As we may see the, `__SEQLOCK_UNLOCKED` macro executes initialization of fields of the given `seqlock_t` structure. The first field is `seqcount` initialized with the `SEQCNT_ZERO` macro which expands to the:
|
||||
|
||||
```C
|
||||
#define SEQCNT_ZERO(lockname) { .sequence = 0, SEQCOUNT_DEP_MAP_INIT(lockname)}
|
||||
```
|
||||
|
||||
So we just initialize counter of the given sequential lock to zero and additionally we can see [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) related initialization which depends on the state of the `CONFIG_DEBUG_LOCK_ALLOC` kernel configuration option:
|
||||
|
||||
```C
|
||||
#ifdef CONFIG_DEBUG_LOCK_ALLOC
|
||||
# define SEQCOUNT_DEP_MAP_INIT(lockname) \
|
||||
.dep_map = { .name = #lockname } \
|
||||
...
|
||||
...
|
||||
...
|
||||
#else
|
||||
# define SEQCOUNT_DEP_MAP_INIT(lockname)
|
||||
...
|
||||
...
|
||||
...
|
||||
#endif
|
||||
```
|
||||
|
||||
As I already wrote in previous parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/) we will not consider [debugging](https://en.wikipedia.org/wiki/Debugging) and [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt) related stuff in this part. So for now we just skip the `SEQCOUNT_DEP_MAP_INIT` macro. The second field of the given `seqlock_t` is `lock` initialized with the `__SPIN_LOCK_UNLOCKED` macro which is defined in the [include/linux/spinlock_types.h](https://github.com/torvalds/linux/blob/master/include/linux/spinlock_types.h) header file. We will not consider implementation of this macro here as it just initialize [rawspinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) with architecture-specific methods (More abot spinlocks you may read in first parts of this [chapter](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/)).
|
||||
|
||||
We have considered the first way to initialize a sequential lock. Let's consider second way to do the same, but do it dynamically. We can initialize a sequentional lock with the `seqlock_init` macro which is defined in the same [include/linux/seqlock.h](https://github.com/torvalds/linux/blob/master/include/linux/seqlock.h) header file.
|
||||
|
||||
Let's look at the implementation of this macro:
|
||||
|
||||
```C
|
||||
#define seqlock_init(x) \
|
||||
do { \
|
||||
seqcount_init(&(x)->seqcount); \
|
||||
spin_lock_init(&(x)->lock); \
|
||||
} while (0)
|
||||
```
|
||||
|
||||
As we may see, the `seqlock_init` expands into two macros. The first macro `seqcount_init` takes counter of the given sequential lock and expands to the call of the `__seqcount_init` function:
|
||||
|
||||
```C
|
||||
# define seqcount_init(s) \
|
||||
do { \
|
||||
static struct lock_class_key __key; \
|
||||
__seqcount_init((s), #s, &__key); \
|
||||
} while (0)
|
||||
```
|
||||
|
||||
from the same header file. This function
|
||||
|
||||
```C
|
||||
static inline void __seqcount_init(seqcount_t *s, const char *name,
|
||||
struct lock_class_key *key)
|
||||
{
|
||||
lockdep_init_map(&s->dep_map, name, key, 0);
|
||||
s->sequence = 0;
|
||||
}
|
||||
```
|
||||
|
||||
just initializes counter of the given `seqcount_t` with zero. The second call from the `seqlock_init` macro is the call of the `spin_lock_init` macro which we saw in the [first part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html) of this chapter.
|
||||
|
||||
So, now we know how to initialize a `sequential lock`, now let's look at how to use it. The Linux kernel provides following [API](https://en.wikipedia.org/wiki/Application_programming_interface) to manipulate `sequential locks`:
|
||||
|
||||
```C
|
||||
static inline unsigned read_seqbegin(const seqlock_t *sl);
|
||||
static inline unsigned read_seqretry(const seqlock_t *sl, unsigned start);
|
||||
static inline void write_seqlock(seqlock_t *sl);
|
||||
static inline void write_sequnlock(seqlock_t *sl);
|
||||
static inline void write_seqlock_irq(seqlock_t *sl);
|
||||
static inline void write_sequnlock_irq(seqlock_t *sl);
|
||||
static inline void read_seqlock_excl(seqlock_t *sl)
|
||||
static inline void read_sequnlock_excl(seqlock_t *sl)
|
||||
```
|
||||
|
||||
and others. Before we move on to considering the implementation of this [API](https://en.wikipedia.org/wiki/Application_programming_interface), we must know that actually there are two types of readers. The first type of reader never blocks a writer process. In this case writer will not wait for readers. The second type of reader which can lock. In this case, the locking reader will block the writer as it will wait while reader will not release its lock.
|
||||
|
||||
First of all let's consider the first type of readers. The `read_seqbegin` function begins a seq-read [critical section](https://en.wikipedia.org/wiki/Critical_section).
|
||||
|
||||
As we may see this function just returns value of the `read_seqcount_begin` function:
|
||||
|
||||
```C
|
||||
static inline unsigned read_seqbegin(const seqlock_t *sl)
|
||||
{
|
||||
return read_seqcount_begin(&sl->seqcount);
|
||||
}
|
||||
```
|
||||
|
||||
In its turn the `read_seqcount_begin` function calls the `raw_read_seqcount_begin` function:
|
||||
|
||||
```C
|
||||
static inline unsigned read_seqcount_begin(const seqcount_t *s)
|
||||
{
|
||||
return raw_read_seqcount_begin(s);
|
||||
}
|
||||
```
|
||||
|
||||
which just returns value of the `sequential lock` counter:
|
||||
|
||||
```C
|
||||
static inline unsigned raw_read_seqcount(const seqcount_t *s)
|
||||
{
|
||||
unsigned ret = READ_ONCE(s->sequence);
|
||||
smp_rmb();
|
||||
return ret;
|
||||
}
|
||||
```
|
||||
|
||||
After we have the initial value of the given `sequential lock` counter and did some stuff, we know from the previous paragraph of this function, that we need to compare it with the current value of the counter the same `sequential lock` before we will exit from the critical section. We can achieve this by the call of the `read_seqretry` function. This function takes a `sequential lock`, start value of the counter and through a chain of functions:
|
||||
|
||||
```C
|
||||
static inline unsigned read_seqretry(const seqlock_t *sl, unsigned start)
|
||||
{
|
||||
return read_seqcount_retry(&sl->seqcount, start);
|
||||
}
|
||||
|
||||
static inline int read_seqcount_retry(const seqcount_t *s, unsigned start)
|
||||
{
|
||||
smp_rmb();
|
||||
return __read_seqcount_retry(s, start);
|
||||
}
|
||||
```
|
||||
|
||||
it calls the `__read_seqcount_retry` function:
|
||||
|
||||
```C
|
||||
static inline int __read_seqcount_retry(const seqcount_t *s, unsigned start)
|
||||
{
|
||||
return unlikely(s->sequence != start);
|
||||
}
|
||||
```
|
||||
|
||||
which just compares value of the counter of the given `sequential lock` with the initial value of this counter. If the initial value of the counter which is obtained from `read_seqbegin()` function is odd, this means that a writer was in the middle of updating the data when our reader began to act. In this case the value of the data can be in inconsistent state, so we need to try to read it again.
|
||||
|
||||
This is a common pattern in the Linux kernel. For example, you may remember the `jiffies` concept from the [first part](https://0xax.gitbooks.io/linux-insides/content/Timers/timers-1.html) of the [timers and time management in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Timers/) chapter. The sequential lock is used to obtain value of `jiffies` at [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture:
|
||||
|
||||
```C
|
||||
u64 get_jiffies_64(void)
|
||||
{
|
||||
unsigned long seq;
|
||||
u64 ret;
|
||||
|
||||
do {
|
||||
seq = read_seqbegin(&jiffies_lock);
|
||||
ret = jiffies_64;
|
||||
} while (read_seqretry(&jiffies_lock, seq));
|
||||
return ret;
|
||||
}
|
||||
```
|
||||
|
||||
Here we just read the value of the counter of the `jiffies_lock` sequential lock and then we write value of the `jiffies_64` system variable to the `ret`. As here we may see `do/while` loop, the body of the loop will be executed at least one time. So, as the body of loop was executed, we read and compare the current value of the counter of the `jiffies_lock` with the initial value. If these values are not equal, execution of the loop will be repeated, else `get_jiffies_64` will return its value in `ret`.
|
||||
|
||||
We just saw the first type of readers which do not block writer and other readers. Let's consider second type. It does not update value of a `sequential lock` counter, but just locks `spinlock`:
|
||||
|
||||
```C
|
||||
static inline void read_seqlock_excl(seqlock_t *sl)
|
||||
{
|
||||
spin_lock(&sl->lock);
|
||||
}
|
||||
```
|
||||
|
||||
So, no one reader or writer can't access protected data. When a reader finishes, the lock must be unlocked with the:
|
||||
|
||||
```C
|
||||
static inline void read_sequnlock_excl(seqlock_t *sl)
|
||||
{
|
||||
spin_unlock(&sl->lock);
|
||||
}
|
||||
```
|
||||
|
||||
function.
|
||||
|
||||
Now we know how `sequential lock` work for readers. Let's consider how does writer act when it wants to acquire a `sequential lock` to modify data. To acquire a `sequential lock`, writer should use `write_seqlock` function. If we look at the implementation of this function:
|
||||
|
||||
```C
|
||||
static inline void write_seqlock(seqlock_t *sl)
|
||||
{
|
||||
spin_lock(&sl->lock);
|
||||
write_seqcount_begin(&sl->seqcount);
|
||||
}
|
||||
```
|
||||
|
||||
We will see that it acquires `spinlock` to prevent access from other writers and calls the `write_seqcount_begin` function. This function just increments value of the `sequential lock` counter:
|
||||
|
||||
```C
|
||||
static inline void raw_write_seqcount_begin(seqcount_t *s)
|
||||
{
|
||||
s->sequence++;
|
||||
smp_wmb();
|
||||
}
|
||||
```
|
||||
|
||||
When a writer process will finish to modify data, the `write_sequnlock` function must be called to release a lock and give access to other writers or readers. Let's consider at the implementation of the `write_sequnlock` function. It looks pretty simple:
|
||||
|
||||
```C
|
||||
static inline void write_sequnlock(seqlock_t *sl)
|
||||
{
|
||||
write_seqcount_end(&sl->seqcount);
|
||||
spin_unlock(&sl->lock);
|
||||
}
|
||||
```
|
||||
|
||||
First of all it just calls `write_seqcount_end` function to increase value of the counter of the `sequential` lock again:
|
||||
|
||||
```C
|
||||
static inline void raw_write_seqcount_end(seqcount_t *s)
|
||||
{
|
||||
smp_wmb();
|
||||
s->sequence++;
|
||||
}
|
||||
```
|
||||
|
||||
and in the end we just call the `spin_unlock` macro to give access for other readers or writers.
|
||||
|
||||
That's all about `sequential lock` mechanism in the Linux kernel. Of course we did not consider full [API](https://en.wikipedia.org/wiki/Application_programming_interface) of this mechanism in this part. But all other functions are based on these which we described here. For example, Linux kernel also provides some safe macros/functions to use `sequential lock` mechanism in [interrupt handlers](https://en.wikipedia.org/wiki/Interrupt_handler) of [softirq](https://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html): `write_seqclock_irq` and `write_sequnlock_irq`:
|
||||
|
||||
```C
|
||||
static inline void write_seqlock_irq(seqlock_t *sl)
|
||||
{
|
||||
spin_lock_irq(&sl->lock);
|
||||
write_seqcount_begin(&sl->seqcount);
|
||||
}
|
||||
|
||||
static inline void write_sequnlock_irq(seqlock_t *sl)
|
||||
{
|
||||
write_seqcount_end(&sl->seqcount);
|
||||
spin_unlock_irq(&sl->lock);
|
||||
}
|
||||
```
|
||||
|
||||
As we may see, these functions differ only in the initialization of spinlock. They call `spin_lock_irq` and `spin_unlock_irq` instead of `spin_lock` and `spin_unlock`.
|
||||
|
||||
Or for example `write_seqlock_irqsave` and `write_sequnlock_irqrestore` functions which are the same but used `spin_lock_irqsave` and `spin_unlock_irqsave` macro to use in [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_(PC_architecture)) handlers.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the sixth part of the [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_%28computer_science%29) chapter in the Linux kernel. In this part we met with new synchronization primitive which is called - `sequential lock`. From the theoretical side, this synchronization primitive very similar on a [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock) synchronization primitive, but allows to avoid `writer-starving` issue.
|
||||
|
||||
If you have questions or suggestions, feel free to ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-insides/issues/new).
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [synchronization primitives](https://en.wikipedia.org/wiki/Synchronization_(computer_science))
|
||||
* [readers-writer lock](https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock)
|
||||
* [spinlock](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-1.html)
|
||||
* [critical section](https://en.wikipedia.org/wiki/Critical_section)
|
||||
* [lock validator](https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt)
|
||||
* [debugging](https://en.wikipedia.org/wiki/Debugging)
|
||||
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
|
||||
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
|
||||
* [Timers and time management in the Linux kernel](https://0xax.gitbooks.io/linux-insides/content/Timers/)
|
||||
* [interrupt handlers](https://en.wikipedia.org/wiki/Interrupt_handler)
|
||||
* [softirq](https://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html)
|
||||
* [IRQ](https://en.wikipedia.org/wiki/Interrupt_request_(PC_architecture))
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/SyncPrim/sync-5.html)
|
@ -1,14 +1,14 @@
|
||||
# Interrupts and Interrupt Handling
|
||||
|
||||
You will find a couple of posts which describe interrupts and exceptions handling in the linux kernel.
|
||||
In the following posts, we will cover interrupts and exceptions handling in the linux kernel.
|
||||
|
||||
* [Interrupts and Interrupt Handling. Part 1.](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-1.md) - describes an interrupts handling theory.
|
||||
* [Start to dive into interrupts in the Linux kernel](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-2.md) - this part starts to describe interrupts and exceptions handling related stuff from the early stage.
|
||||
* [Early interrupt handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-3.md) - third part describes early interrupt handlers.
|
||||
* [Interrupt handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-4.md) - fourth part describes first non-early interrupt handlers.
|
||||
* [Implementation of exception handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-5.md) - descripbes implementation of some exception handlers as double fault, divide by zero and etc.
|
||||
* [Handling Non-Maskable interrupts](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-6.md) - describes handling of non-maskable interrupts and the rest of interrupts handlers from the architecture-specific part.
|
||||
* [Dive into external hardware interrupts](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-7.md) - this part describes early initialization of code which is related to handling of external hardware interrupts.
|
||||
* [Non-early initialization of the IRQs](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-8.md) - this part describes non-early initialization of code which is related to handling of external hardware interrupts.
|
||||
* [Softirq, Tasklets and Workqueues](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-9.md) - this part describes softirqs, tasklets and workqueues concepts.
|
||||
* [](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-10.md) - this is the last part of the interrupts and interrupt handling chapter and here we will see a real hardware driver and interrupts related stuff.
|
||||
* [Interrupts and Interrupt Handling. Part 1.](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-1.md) - describes interrupts and interrupt handling theory.
|
||||
* [Interrupts in the Linux Kernel](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-2.md) - describes stuffs related to interrupts and exceptions handling from the early stage.
|
||||
* [Early interrupt handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-3.md) - describes early interrupt handlers.
|
||||
* [Interrupt handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-4.md) - describes first non-early interrupt handlers.
|
||||
* [Implementation of exception handlers](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-5.md) - describes implementation of some exception handlers such as double fault, divide by zero etc.
|
||||
* [Handling non-maskable interrupts](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-6.md) - describes handling of non-maskable interrupts and remaining interrupt handlers from the architecture-specific part.
|
||||
* [External hardware interrupts](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-7.md) - describes early initialization of code which is related to handling external hardware interrupts.
|
||||
* [Non-early initialization of the IRQs](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-8.md) - describes non-early initialization of code which is related to handling external hardware interrupts.
|
||||
* [Softirq, Tasklets and Workqueues](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-9.md) - describes softirqs, tasklets and workqueues concepts.
|
||||
* [](https://github.com/0xAX/linux-insides/blob/master/interrupts/interrupts-10.md) - this is the last part of the `Interrupts and Interrupt Handling` chapter and here we will see a real hardware driver and some interrupts related stuff.
|
||||
|
@ -0,0 +1,434 @@
|
||||
Linux kernel memory management Part 3.
|
||||
================================================================================
|
||||
|
||||
Introduction to the kmemcheck in the Linux kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the third part of the [chapter](https://0xax.gitbooks.io/linux-insides/content/mm/) which describes [memory management](https://en.wikipedia.org/wiki/Memory_management) in the Linux kernel and in the previous [part](https://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html) of this chapter we met two memory management related concepts:
|
||||
|
||||
* `Fix-Mapped Addresses`;
|
||||
* `ioremap`.
|
||||
|
||||
The first concept represents special area in [virtual memory](https://en.wikipedia.org/wiki/Virtual_memory), whose corresponding physical mapping is calculated in [compile-time](https://en.wikipedia.org/wiki/Compile_time). The second concept provides ability to map input/output related memory to virtual memory.
|
||||
|
||||
For example if you will look at the output of the `/proc/iomem`:
|
||||
|
||||
```
|
||||
$ sudo cat /proc/iomem
|
||||
|
||||
00000000-00000fff : reserved
|
||||
00001000-0009d7ff : System RAM
|
||||
0009d800-0009ffff : reserved
|
||||
000a0000-000bffff : PCI Bus 0000:00
|
||||
000c0000-000cffff : Video ROM
|
||||
000d0000-000d3fff : PCI Bus 0000:00
|
||||
000d4000-000d7fff : PCI Bus 0000:00
|
||||
000d8000-000dbfff : PCI Bus 0000:00
|
||||
000dc000-000dffff : PCI Bus 0000:00
|
||||
000e0000-000fffff : reserved
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
you will see map of the system's memory for each physical device. Here the first column displays the memory registers used by each of the different types of memory. The second column lists the kind of memory located within those registers. Or for example:
|
||||
|
||||
```
|
||||
$ sudo cat /proc/ioports
|
||||
|
||||
0000-0cf7 : PCI Bus 0000:00
|
||||
0000-001f : dma1
|
||||
0020-0021 : pic1
|
||||
0040-0043 : timer0
|
||||
0050-0053 : timer1
|
||||
0060-0060 : keyboard
|
||||
0064-0064 : keyboard
|
||||
0070-0077 : rtc0
|
||||
0080-008f : dma page reg
|
||||
00a0-00a1 : pic2
|
||||
00c0-00df : dma2
|
||||
00f0-00ff : fpu
|
||||
00f0-00f0 : PNP0C04:00
|
||||
03c0-03df : vga+
|
||||
03f8-03ff : serial
|
||||
04d0-04d1 : pnp 00:06
|
||||
0800-087f : pnp 00:01
|
||||
0a00-0a0f : pnp 00:04
|
||||
0a20-0a2f : pnp 00:04
|
||||
0a30-0a3f : pnp 00:04
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
can show us lists of currently registered port regions used for input or output communication with a device. All memory-mapped I/O addresses are not used by the kernel directly. So, before the Linux kernel can use such memory, it must to map it to the virtual memory space which is the main purpose of the `ioremap` mechanism. Note that we saw only early `ioremap` in the previous [part](https://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html). Soon we will look at the implementation of the non-early `ioremap` function. But before this we must learn other things, like a different types of memory allocators and etc., because in other way it will be very difficult to understand it.
|
||||
|
||||
So, before we will move on to the non-early [memory management](https://en.wikipedia.org/wiki/Memory_management) of the Linux kernel, we will see some mechanisms which provide special abilities for [debugging](https://en.wikipedia.org/wiki/Debugging), check of [memory leaks](https://en.wikipedia.org/wiki/Memory_leak), memory control and etc. It will be easier to understand how memory management arranged in the Linux kernel after learning of all of these things.
|
||||
|
||||
As you already may guess from the title of this part, we will start to consider memory mechanisms from the [kmemcheck](https://www.kernel.org/doc/Documentation/kmemcheck.txt). As we always did in other [chapters](https://0xax.gitbooks.io/linux-insides/content/), we will start to consider from theoretical side and will learn what is `kmemcheck` mechanism in general and only after this, we will see how it is implemented in the Linux kernel.
|
||||
|
||||
So let's start. What is it `kmemcheck` in the Linux kernel? As you may gues from the name of this mechanism, the `kmemcheck` checks memory. That's true. Main point of the `kmemcheck` mechanism is to check that some kernel code accesses `uninitialized memory`. Let's take following simple [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) program:
|
||||
|
||||
```C
|
||||
#include <stdlib.h>
|
||||
#include <stdio.h>
|
||||
|
||||
struct A {
|
||||
int a;
|
||||
};
|
||||
|
||||
int main(int argc, char **argv) {
|
||||
struct A *a = malloc(sizeof(struct A));
|
||||
printf("a->a = %d\n", a->a);
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
Here we allocate memory for the `A` structure and tries to print value of the `a` field. If we will compile this program without additional options:
|
||||
|
||||
```
|
||||
gcc test.c -o test
|
||||
```
|
||||
|
||||
The [compiler](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) will not show us warning that `a` filed is not unitialized. But if we will run this program with [valgrind](https://en.wikipedia.org/wiki/Valgrind) tool, we will see the following output:
|
||||
|
||||
```
|
||||
~$ valgrind --leak-check=yes ./test
|
||||
==28469== Memcheck, a memory error detector
|
||||
==28469== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
|
||||
==28469== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
|
||||
==28469== Command: ./test
|
||||
==28469==
|
||||
==28469== Conditional jump or move depends on uninitialised value(s)
|
||||
==28469== at 0x4E820EA: vfprintf (in /usr/lib64/libc-2.22.so)
|
||||
==28469== by 0x4E88D48: printf (in /usr/lib64/libc-2.22.so)
|
||||
==28469== by 0x4005B9: main (in /home/alex/test)
|
||||
==28469==
|
||||
==28469== Use of uninitialised value of size 8
|
||||
==28469== at 0x4E7E0BB: _itoa_word (in /usr/lib64/libc-2.22.so)
|
||||
==28469== by 0x4E8262F: vfprintf (in /usr/lib64/libc-2.22.so)
|
||||
==28469== by 0x4E88D48: printf (in /usr/lib64/libc-2.22.so)
|
||||
==28469== by 0x4005B9: main (in /home/alex/test)
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
Actually the `kmemcheck` mechanism does the same for the kernel, what the `valgrind` does for userspace programs. It check unitilized memory.
|
||||
|
||||
To enable this mechanism in the Linux kernel, you need to enable the `CONFIG_KMEMCHECK` kernel configuration option in the:
|
||||
|
||||
```
|
||||
Kernel hacking
|
||||
-> Memory Debugging
|
||||
```
|
||||
|
||||
menu of the Linux kernel configuration:
|
||||
|
||||
![kernel configuration menu](http://oi63.tinypic.com/2pzbog7.jpg)
|
||||
|
||||
We may not only enable support of the `kmemcheck` mechanism in the Linux kernel, but it also provides some configuration options for us. We will see all of these options in the next paragraph of this part. Last note before we will consider how does the `kmemcheck` check memory. Now this mechanism is implemented only for the [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture. You can be sure if you will look in the [arch/x86/Kconfig](https://github.com/torvalds/linux/blob/master/arch/x86/Kconfig) `x86` related kernel configuration file, you will see following lines:
|
||||
|
||||
```
|
||||
config X86
|
||||
...
|
||||
...
|
||||
...
|
||||
select HAVE_ARCH_KMEMCHECK
|
||||
...
|
||||
...
|
||||
...
|
||||
```
|
||||
|
||||
So, there is no anything which is specific for other architectures.
|
||||
|
||||
Ok, so we know that `kmemcheck` provides mechanism to check usage of `uninitialized memory` in the Linux kernel and how to enable it. How it does these checks? When the Linux kernel tries to allocate some memory i.e. something is called like this:
|
||||
|
||||
```C
|
||||
struct my_struct *my_struct = kmalloc(sizeof(struct my_struct), GFP_KERNEL);
|
||||
```
|
||||
|
||||
or in other words somebody wants to access a [page](https://en.wikipedia.org/wiki/Page_%28computer_memory%29), a [page fault](https://en.wikipedia.org/wiki/Page_fault) exception is generated. This is achieved by the fact that the `kmemcheck` marks memory pages as `non-present` (more about this you can read in the special part which is devoted to [paging](https://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html)). If a `page fault` exception is occured, the exception handler knows about it and in a case when the `kmemcheck` is enabled it transfers control to it. After the `kmemcheck` will finish its checks, the page will be marked as `present` and the interrupted code will be able to continue execution. There is little subtlety in this chain. When the first instruction of interrupted code will be executed, the `kmemcheck` will mark the page as `non-present` again. In this way next access to memory will be catched again.
|
||||
|
||||
We just considered the `kmemcheck` mechanism from theoretical side. Now let's consider how it is implemented in the Linux kernel.
|
||||
|
||||
Implementation of the `kmemcheck` mechanism in the Linux kernel
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
So, now we know what is it `kmemcheck` and what it does in the Linux kernel. Time to see at its implementation in the Linux kernel. Implementation of the `kmemcheck` is splitted in two parts. The first is generic part is located in the [mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c) source code file and the second [x86_64](https://en.wikipedia.org/wiki/X86-64) architecture-specific part is located in the [arch/x86/mm/kmemcheck](https://github.com/torvalds/linux/tree/master/arch/x86/mm/kmemcheck) directory.
|
||||
|
||||
Let's start from the initialization of this mechanism. We already know that to enable the `kmemcheck` mechanism in the Linux kernel, we must enable the `CONFIG_KMEMCHECK` kernel configuration option. But besides this, we need to pass one of following parameters:
|
||||
|
||||
* kmemcheck=0 (disabled)
|
||||
* kmemcheck=1 (enabled)
|
||||
* kmemcheck=2 (one-shot mode)
|
||||
|
||||
to the Linux kernel command line. The first two are clear, but the last needs a little explanation. This option switches the `kmemcheck` in a special mode when it will be turned off after detecting the first use of uninitialized memory. Actually this mode is enabled by default in the Linux kernel:
|
||||
|
||||
![kernel configuration menu](http://oi66.tinypic.com/y2eeh.jpg)
|
||||
|
||||
We know from the seventh [part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-7.html) of the chapter which describes initialization of the Linux kernel that the kernel command line is parsed during initialization of the Linux kernel in `do_initcall_level`, `do_early_param` functions. Actually the `kmemcheck` subsystem consists from two stages. The first stage is early. If we will look at the [mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c) source code file, we will see the `param_kmemcheck` function which is will be called during early command line parsing:
|
||||
|
||||
```C
|
||||
static int __init param_kmemcheck(char *str)
|
||||
{
|
||||
int val;
|
||||
int ret;
|
||||
|
||||
if (!str)
|
||||
return -EINVAL;
|
||||
|
||||
ret = kstrtoint(str, 0, &val);
|
||||
if (ret)
|
||||
return ret;
|
||||
kmemcheck_enabled = val;
|
||||
return 0;
|
||||
}
|
||||
|
||||
early_param("kmemcheck", param_kmemcheck);
|
||||
```
|
||||
|
||||
As we already saw, the `param_kmemcheck` may have one of the following values: `0` (enabled), `1` (disabled) or `2` (one-shot). The implementation of the `param_kmemcheck` is pretty simple. We just convert string value of the `kmemcheck` command line option to integer representation and set it to the `kmemcheck_enabled` variable.
|
||||
|
||||
The second stage will be executed during initialization of the Linux kernel, rather during intialization of early [initcalls](https://0xax.gitbooks.io/linux-insides/content/Concepts/initcall.html). The second stage is represented by the `kmemcheck_init`:
|
||||
|
||||
```C
|
||||
int __init kmemcheck_init(void)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
|
||||
early_initcall(kmemcheck_init);
|
||||
```
|
||||
|
||||
Main goal of the `kmemcheck_init` function is to call the `kmemcheck_selftest` function and check its result:
|
||||
|
||||
```C
|
||||
if (!kmemcheck_selftest()) {
|
||||
printk(KERN_INFO "kmemcheck: self-tests failed; disabling\n");
|
||||
kmemcheck_enabled = 0;
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
printk(KERN_INFO "kmemcheck: Initialized\n");
|
||||
```
|
||||
|
||||
and return with the `EINVAL` if this check is failed. The `kmemcheck_selftest` function checks sizes of different memory access related [opcodes](https://en.wikipedia.org/wiki/Opcode) like `rep movsb`, `movzwq` and etc. If sizes of opcodes are equal to expected sizes, the `kmemcheck_selftest` will return `true` and `false` in other way.
|
||||
|
||||
So when the somebody will call:
|
||||
|
||||
```C
|
||||
struct my_struct *my_struct = kmalloc(sizeof(struct my_struct), GFP_KERNEL);
|
||||
```
|
||||
|
||||
through a series of different function calls the `kmem_getpages` function will be called. This function is defined in the [mm/slab.c](https://github.com/torvalds/linux/blob/master/mm/slab.c) source code file and main goal of this function tries to allocate [pages](https://en.wikipedia.org/wiki/Paging) with the given flags. In the end of this function we can see following code:
|
||||
|
||||
```C
|
||||
if (kmemcheck_enabled && !(cachep->flags & SLAB_NOTRACK)) {
|
||||
kmemcheck_alloc_shadow(page, cachep->gfporder, flags, nodeid);
|
||||
|
||||
if (cachep->ctor)
|
||||
kmemcheck_mark_uninitialized_pages(page, nr_pages);
|
||||
else
|
||||
kmemcheck_mark_unallocated_pages(page, nr_pages);
|
||||
}
|
||||
```
|
||||
|
||||
So, here we check that the if `kmemcheck` is enabled and the `SLAB_NOTRACK` bit is not set in flags we set `non-present` bit for the just allocated page. The `SLAB_NOTRACK` bit tell us to not track uninitialized memory. Additionally we check if a cache object has constructor (details will be considered in next parts) we mark allocated page as uninitilized or unallocated in other way. The `kmemcheck_alloc_shadow` function is defined in the [mm/kmemcheck.c](https://github.com/torvalds/linux/blob/master/mm/kmemcheck.c) source code file and does following things:
|
||||
|
||||
```C
|
||||
void kmemcheck_alloc_shadow(struct page *page, int order, gfp_t flags, int node)
|
||||
{
|
||||
struct page *shadow;
|
||||
|
||||
shadow = alloc_pages_node(node, flags | __GFP_NOTRACK, order);
|
||||
|
||||
for(i = 0; i < pages; ++i)
|
||||
page[i].shadow = page_address(&shadow[i]);
|
||||
|
||||
kmemcheck_hide_pages(page, pages);
|
||||
}
|
||||
```
|
||||
|
||||
First of all it allocates memory space for the shadow bits. If this bit is set in a page, this means that this page is tracked by the `kmemcheck`. After we allocated space for the shadow bit, we fill all allocated pages with this bit. In the end we just call the `kmemcheck_hide_pages` function with the pointer to the allocated page and number of these pages. The `kmemcheck_hide_pages` is architecture-specific function, so its implementation is located in the [arch/x86/mm/kmemcheck/kmemcheck.c](https://github.com/torvalds/linux/tree/master/arch/x86/mm/kmemcheck/kmemcheck.c) source code file. The main goal of this function is to set `non-present` bit in given pages. Let's look at the implementation of this function:
|
||||
|
||||
```C
|
||||
void kmemcheck_hide_pages(struct page *p, unsigned int n)
|
||||
{
|
||||
unsigned int i;
|
||||
|
||||
for (i = 0; i < n; ++i) {
|
||||
unsigned long address;
|
||||
pte_t *pte;
|
||||
unsigned int level;
|
||||
|
||||
address = (unsigned long) page_address(&p[i]);
|
||||
pte = lookup_address(address, &level);
|
||||
BUG_ON(!pte);
|
||||
BUG_ON(level != PG_LEVEL_4K);
|
||||
|
||||
set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT));
|
||||
set_pte(pte, __pte(pte_val(*pte) | _PAGE_HIDDEN));
|
||||
__flush_tlb_one(address);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Here we go through all pages and and tries to get `page table entry` for each page. If this operation was successful, we unset present bit and set hidden bit in each page. In the end we flush [translation lookaside buffer](https://en.wikipedia.org/wiki/Translation_lookaside_buffer), because some pages was changed. From this point allocated pages are tracked by the `kmemcheck`. Now, as `present` bit is unset, the [page fault](https://en.wikipedia.org/wiki/Page_fault) execution will be occured right after the `kmalloc` will return pointer to allocated space and a code will try to access this memory.
|
||||
|
||||
As you may remember from the [second part](https://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-2.html) of the Linux kernel initialization chapter, the `page fault` handler is located in the [arch/x86/mm/fault.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/fault.c) source code file and represented by the `do_page_fault` function. We can see following check from the beginning of the `do_page_fault` function:
|
||||
|
||||
```C
|
||||
static noinline void
|
||||
__do_page_fault(struct pt_regs *regs, unsigned long error_code,
|
||||
unsigned long address)
|
||||
{
|
||||
...
|
||||
...
|
||||
...
|
||||
if (kmemcheck_active(regs))
|
||||
kmemcheck_hide(regs);
|
||||
...
|
||||
...
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
The `kmemcheck_active` gets `kmemcheck_context` [per-cpu](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html) structure and return the result of comparision of the `balance` field of this structure with zero:
|
||||
|
||||
```
|
||||
bool kmemcheck_active(struct pt_regs *regs)
|
||||
{
|
||||
struct kmemcheck_context *data = this_cpu_ptr(&kmemcheck_context);
|
||||
|
||||
return data->balance > 0;
|
||||
}
|
||||
```
|
||||
|
||||
The `kmemcheck_context` is structure which describes current state of the `kmemcheck` mechanism. It stored unitialized addresses, number of such addresses and etc. The `balance` field of this structure represents current state of the `kmemcheck` or in other words it can tell us did `kmemcheck` already hid pages or not yet. If the `data->balance` is greater than zero, the `kmemcheck_hide` function will be called. This means than `kmemecheck` already set `present` bit for given pages and now we need to hide pages again to to cause nest step page fault. This function will hide addresses of pages again by unsetting of `present` bit. This means that one session of `kmemcheck` already finished and new page fault occured. At the first step the `kmemcheck_active` will return false as the `data->balance` is zero for the start and the `kmemcheck_hide` will not be called. Next, we may see following line of code in the `do_page_fault`:
|
||||
|
||||
```C
|
||||
if (kmemcheck_fault(regs, address, error_code))
|
||||
return;
|
||||
```
|
||||
|
||||
First of all the `kmemcheck_fault` function checks that the fault was occured by the correct reason. At first we check the [flags register](https://en.wikipedia.org/wiki/FLAGS_register) and check that we are in normal kernel mode:
|
||||
|
||||
```C
|
||||
if (regs->flags & X86_VM_MASK)
|
||||
return false;
|
||||
if (regs->cs != __KERNEL_CS)
|
||||
return false;
|
||||
```
|
||||
|
||||
If these checks wasn't successful we return from the `kmemcheck_fault` function as it was not `kmemcheck` related page fault. After this we try to lookup a `page table entry` related to the faulted address and if we can't find it we return:
|
||||
|
||||
```C
|
||||
pte = kmemcheck_pte_lookup(address);
|
||||
if (!pte)
|
||||
return false;
|
||||
```
|
||||
|
||||
Last two steps of the `kmemcheck_fault` function is to call the `kmemcheck_access` function which check access to the given page and show addresses again by setting present bit in the given page. The `kmemcheck_access` function does all main job. It check current instruction which caused a page fault. If it will find an error, the context of this error will be saved by `kmemcheck` to the ring queue:
|
||||
|
||||
```C
|
||||
static struct kmemcheck_error error_fifo[CONFIG_KMEMCHECK_QUEUE_SIZE];
|
||||
```
|
||||
|
||||
The `kmemcheck` mechanism declares special [tasklet](https://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html):
|
||||
|
||||
```C
|
||||
static DECLARE_TASKLET(kmemcheck_tasklet, &do_wakeup, 0);
|
||||
```
|
||||
|
||||
which runs the `do_wakeup` function from the [arch/x86/mm/kmemcheck/error.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/kmemcheck/error.c) source code file when it will be scheduled to run.
|
||||
|
||||
The `do_wakeup` function will call the `kmemcheck_error_recall` function which will print errors collected by `kmemcheck`. As we already saw the:
|
||||
|
||||
```C
|
||||
kmemcheck_show(regs);
|
||||
```
|
||||
|
||||
function will be called in the end of the `kmemcheck_fault` function. This function will set present bit for the given pages again:
|
||||
|
||||
```C
|
||||
if (unlikely(data->balance != 0)) {
|
||||
kmemcheck_show_all();
|
||||
kmemcheck_error_save_bug(regs);
|
||||
data->balance = 0;
|
||||
return;
|
||||
}
|
||||
```
|
||||
|
||||
Where the `kmemcheck_show_all` function calls the `kmemcheck_show_addr` for each address:
|
||||
|
||||
```C
|
||||
static unsigned int kmemcheck_show_all(void)
|
||||
{
|
||||
struct kmemcheck_context *data = this_cpu_ptr(&kmemcheck_context);
|
||||
unsigned int i;
|
||||
unsigned int n;
|
||||
|
||||
n = 0;
|
||||
for (i = 0; i < data->n_addrs; ++i)
|
||||
n += kmemcheck_show_addr(data->addr[i]);
|
||||
|
||||
return n;
|
||||
}
|
||||
```
|
||||
|
||||
by the call of the `kmemcheck_show_addr`:
|
||||
|
||||
```C
|
||||
int kmemcheck_show_addr(unsigned long address)
|
||||
{
|
||||
pte_t *pte;
|
||||
|
||||
pte = kmemcheck_pte_lookup(address);
|
||||
if (!pte)
|
||||
return 0;
|
||||
|
||||
set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT));
|
||||
__flush_tlb_one(address);
|
||||
return 1;
|
||||
}
|
||||
```
|
||||
|
||||
In the end of the `kmemcheck_show` function we set the [TF](https://en.wikipedia.org/wiki/Trap_flag) flag if it wasn't set:
|
||||
|
||||
```C
|
||||
if (!(regs->flags & X86_EFLAGS_TF))
|
||||
data->flags = regs->flags;
|
||||
```
|
||||
|
||||
We need to do it because we need to hide pages again after first executed instruction after a page fault will be handled. In a case when the `TF` flag, so the processor will switch into single-step mode after the first instruction will be executed. In this case `debug` exception will occured. From this moment pages will be hidden again and execution will be continued. As pages hidden from this moment, page fault exception will occur again and `kmemcheck` continue to check/collect errors again and print them from time to time.
|
||||
|
||||
That's all.
|
||||
|
||||
Conclusion
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
This is the end of the third part about linux kernel [memory management](https://en.wikipedia.org/wiki/Memory_management). If you have questions or suggestions, ping me on twitter [0xAX](https://twitter.com/0xAX), drop me an [email](anotherworldofworld@gmail.com) or just create an [issue](https://github.com/0xAX/linux-insides/issues/new). In the next part we will see yet another memory debugging related tool - `kmemleak`.
|
||||
|
||||
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me a PR to [linux-insides](https://github.com/0xAX/linux-insides).**
|
||||
|
||||
Links
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
* [memory management](https://en.wikipedia.org/wiki/Memory_management)
|
||||
* [debugging](https://en.wikipedia.org/wiki/Debugging)
|
||||
* [memory leaks](https://en.wikipedia.org/wiki/Memory_leak)
|
||||
* [kmemcheck documentation](https://www.kernel.org/doc/Documentation/kmemcheck.txt)
|
||||
* [valgrind](https://en.wikipedia.org/wiki/Valgrind)
|
||||
* [paging](https://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html)
|
||||
* [page fault](https://en.wikipedia.org/wiki/Page_fault)
|
||||
* [initcalls](https://0xax.gitbooks.io/linux-insides/content/Concepts/initcall.html)
|
||||
* [opcode](https://en.wikipedia.org/wiki/Opcode)
|
||||
* [translation lookaside buffer](https://en.wikipedia.org/wiki/Translation_lookaside_buffer)
|
||||
* [per-cpu variables](https://0xax.gitbooks.io/linux-insides/content/Concepts/per-cpu.html)
|
||||
* [flags register](https://en.wikipedia.org/wiki/FLAGS_register)
|
||||
* [tasklet](https://0xax.gitbooks.io/linux-insides/content/interrupts/interrupts-9.html)
|
||||
* [Paging](http://0xax.gitbooks.io/linux-insides/content/Theory/Paging.html)
|
||||
* [Previous part](https://0xax.gitbooks.io/linux-insides/content/mm/linux-mm-2.html)
|
Loading…
Reference in new issue