1
0
mirror of https://github.com/0xAX/linux-insides.git synced 2025-01-05 05:10:55 +00:00

Improve sentence structure and fix some typos

This commit is contained in:
Donny Nadolny 2015-06-07 19:50:27 -04:00
parent 1c8bb3420b
commit fe624fe797
3 changed files with 47 additions and 43 deletions

View File

@ -61,3 +61,4 @@ Thank you to all contributors:
* [TheCodeArtist](https://github.com/TheCodeArtist) * [TheCodeArtist](https://github.com/TheCodeArtist)
* [Ehsun N](https://github.com/imehsunn) * [Ehsun N](https://github.com/imehsunn)
* [Adam Shannon](https://github.com/adamdecaf) * [Adam Shannon](https://github.com/adamdecaf)
* [Donny Nadolny](https://github.com/dnadolny)

View File

@ -4,13 +4,13 @@ Linux kernel memory management Part 1.
Introduction Introduction
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
Memory management is a one of the most complex (and I think that it is the most complex) parts of the operating system kernel. In the [last preparations before the kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html) part we stopped right before call of the `start_kernel` function. This function initializes all the kernel features (including architecture-dependent features) before the kernel runs the first `init` process. You may remember as we built early page tables, identity page tables and fixmap page tables in the boot time. No compilcated memory management is working yet. When the `start_kernel` function is called we will see the transition to more complex data structures and techniques for memory management. For a good understanding of the initialization process in the linux kernel we need to have clear understanding of the techniques. This chapter will provide an overview of the different parts of the linux kernel memory management framework and its API, starting from the `memblock`. Memory management is one of the most complex (and I think that it is the most complex) parts of the operating system kernel. In the [last preparations before the kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html) part we stopped right before call of the `start_kernel` function. This function initializes all the kernel features (including architecture-dependent features) before the kernel runs the first `init` process. You may remember as we built early page tables, identity page tables and fixmap page tables in the boot time. No compilcated memory management is working yet. When the `start_kernel` function is called we will see the transition to more complex data structures and techniques for memory management. For a good understanding of the initialization process in the linux kernel we need to have a clear understanding of these techniques. This chapter will provide an overview of the different parts of the linux kernel memory management framework and its API, starting from the `memblock`.
Memblock Memblock
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
Memblock is one of methods of managing memory regions during the early bootstrap period while the usual kernel memory allocators are not up and Memblock is one of the methods of managing memory regions during the early bootstrap period while the usual kernel memory allocators are not up and
running yet. Previously it was called - `Logical Memory Block`, but from the [patch](https://lkml.org/lkml/2010/7/13/68) by Yinghai Lu, it was renamed to the `memblock`. As Linux kernel for `x86_64` architecture uses this method. We already met `memblock` in the [Last preparations before the kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html) part. And now time to get acquainted with it closer. We will see how it is implemented. running yet. Previously it was called `Logical Memory Block`, but with the [patch](https://lkml.org/lkml/2010/7/13/68) by Yinghai Lu, it was renamed to the `memblock`. As Linux kernel for `x86_64` architecture uses this method. We already met `memblock` in the [Last preparations before the kernel entry point](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html) part. And now time to get acquainted with it closer. We will see how it is implemented.
We will start to learn `memblock` from the data structures. Definitions of the all data structures can be found in the [include/linux/memblock.h](https://github.com/torvalds/linux/blob/master/include/linux/memblock.h) header file. We will start to learn `memblock` from the data structures. Definitions of the all data structures can be found in the [include/linux/memblock.h](https://github.com/torvalds/linux/blob/master/include/linux/memblock.h) header file.
@ -28,7 +28,7 @@ struct memblock {
}; };
``` ```
This structure contains five fields. First is `bottom_up` which allows to allocate memory in bottom-up mode when it is `true`. Next field is `current_limit`. This field describes the limit size of the memory block. The next three fields describes the type of the memory block. It can be: reserved, memory and physical memory if `CONFIG_HAVE_MEMBLOCK_PHYS_MAP` configuration option is enabled. Now we met yet another data structure - `memblock_type`. Let's look on its definition: This structure contains five fields. First is `bottom_up` which allows allocating memory in bottom-up mode when it is `true`. Next field is `current_limit`. This field describes the limit size of the memory block. The next three fields describe the type of the memory block. It can be: reserved, memory and physical memory if the `CONFIG_HAVE_MEMBLOCK_PHYS_MAP` configuration option is enabled. Now we see yet another data structure - `memblock_type`. Let's look at its definition:
```C ```C
struct memblock_type { struct memblock_type {
@ -39,7 +39,7 @@ struct memblock_type {
}; };
``` ```
This structure provides information about memory type. It contains fields which describe number of memory regions which are inside current memory block, size of the all memory regions, size of the allocated array of the memory regions and pointer to the array of the `memblock_region` structures. `memblock_region` is a structure which describes memory region. Its definition looks: This structure provides information about memory type. It contains fields which describe the number of memory regions which are inside the current memory block, the size of all memory regions, the size of the allocated array of the memory regions and pointer to the array of the `memblock_region` structures. `memblock_region` is a structure which describes a memory region. Its definition is:
```C ```C
struct memblock_region { struct memblock_region {
@ -60,7 +60,7 @@ struct memblock_region {
#define MEMBLOCK_HOTPLUG 0x1 #define MEMBLOCK_HOTPLUG 0x1
``` ```
Also `memblock_region` provides integer field - [numa](http://en.wikipedia.org/wiki/Non-uniform_memory_access) node selector, if `CONFIG_HAVE_MEMBLOCK_NODE_MAP` configuration option is enabled. Also `memblock_region` provides integer field - [numa](http://en.wikipedia.org/wiki/Non-uniform_memory_access) node selector, if the `CONFIG_HAVE_MEMBLOCK_NODE_MAP` configuration option is enabled.
Schematically we can imagine it as: Schematically we can imagine it as:
@ -85,7 +85,7 @@ These three structures: `memblock`, `memblock_type` and `memblock_region` are ma
Memblock initialization Memblock initialization
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
As all API of the `memblock` described in the [include/linux/memblock.h](https://github.com/torvalds/linux/blob/master/include/linux/memblock.h) header file, all implementation of these function is in the [mm/memblock.c](https://github.com/torvalds/linux/blob/master/mm/memblock.c) source code file. Let's look on the top of source code file and we will look there initialization of the `memblock` structure: As all API of the `memblock` described in the [include/linux/memblock.h](https://github.com/torvalds/linux/blob/master/include/linux/memblock.h) header file, all implementation of these function is in the [mm/memblock.c](https://github.com/torvalds/linux/blob/master/mm/memblock.c) source code file. Let's look at the top of the source code file and we will see the initialization of the `memblock` structure:
```C ```C
struct memblock memblock __initdata_memblock = { struct memblock memblock __initdata_memblock = {
@ -121,7 +121,7 @@ Here we can see initialization of the `memblock` structure which has the same na
You can note that it depends on `CONFIG_ARCH_DISCARD_MEMBLOCK`. If this configuration option is enabled, memblock code will be put to the `.init` section and it will be released after the kernel is booted up. You can note that it depends on `CONFIG_ARCH_DISCARD_MEMBLOCK`. If this configuration option is enabled, memblock code will be put to the `.init` section and it will be released after the kernel is booted up.
Next we can see initialization of the `memblock_type memory`, `memblock_type reserved` and `memblock_type physmem` fields of the `memblock` structure. Here we interesting only in the `memblock_type.regions` initialization process. Note that every `memblock_type` field initialized by the arrays of the `memblock_region`: Next we can see initialization of the `memblock_type memory`, `memblock_type reserved` and `memblock_type physmem` fields of the `memblock` structure. Here we are interested only in the `memblock_type.regions` initialization process. Note that every `memblock_type` field initialized by the arrays of the `memblock_region`:
```C ```C
static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock; static struct memblock_region memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
@ -152,15 +152,15 @@ On this step initialization of the `memblock` structure finished and we can look
Memblock API Memblock API
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
Ok we have finished with initilization of the `memblock` structure and now we can look on the Memblock API and its implementation. As i said about, all implementation of the `memblock` presented in the [mm/memblock.c](https://github.com/torvalds/linux/blob/master/mm/memblock.c). To understand how `memblock` works and implemented, let's look on it's usage first of all. There are a couple of [places](http://lxr.free-electrons.com/ident?i=memblock) in the linux kernel where memblock is used. For example let's take `memblock_x86_fill` function from the [arch/x86/kernel/e820.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/e820.c#L1061). This function goes through the memory map provided by the [e820](http://en.wikipedia.org/wiki/E820) and adds memory regions reserved by the kernel to the `memblock` with the `memblock_add` function. As we met `memblock_add` function first, let's start from it. Ok we have finished with initilization of the `memblock` structure and now we can look on the Memblock API and its implementation. As I said above, all implementation of the `memblock` presented in the [mm/memblock.c](https://github.com/torvalds/linux/blob/master/mm/memblock.c). To understand how `memblock` works and is implemented, let's look at its usage first of all. There are a couple of [places](http://lxr.free-electrons.com/ident?i=memblock) in the linux kernel where memblock is used. For example let's take `memblock_x86_fill` function from the [arch/x86/kernel/e820.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/e820.c#L1061). This function goes through the memory map provided by the [e820](http://en.wikipedia.org/wiki/E820) and adds memory regions reserved by the kernel to the `memblock` with the `memblock_add` function. As we met `memblock_add` function first, let's start from it.
This function takes physical base address and size of the memory region and adds it to the `memblock`. `memblock_add` function does not anything special in its body, but just calls: This function takes physical base address and size of the memory region and adds it to the `memblock`. `memblock_add` function does not do anything special in its body, but just calls:
```C ```C
memblock_add_range(&memblock.memory, base, size, MAX_NUMNODES, 0); memblock_add_range(&memblock.memory, base, size, MAX_NUMNODES, 0);
``` ```
function. We pass memory block type - `memory`, physical base address and size of the memory region, maximum number of nodes which are zero if `CONFIG_NODES_SHIFT` is not set in the configuration file or `CONFIG_NODES_SHIFT` if it is set, and flags. `memblock_add_range` function adds new memory region to the memory block. It starts from check the size of the given region and if it is zero just return. After this, `memblock_add_range` check existence of the memory regions in the `memblock` structure with the given `memblock_type`. If there are no memory regions, we just fill new `memory_region` with the given values and return (we already saw implementation of this in the [First touch of the linux kernel memory manager framework](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html)). If `memblock_type` is no empty, we start to add new memory region to the `memblock` with the given `memblock_type`. function. We pass memory block type - `memory`, physical base address and size of the memory region, maximum number of nodes which are zero if `CONFIG_NODES_SHIFT` is not set in the configuration file or `CONFIG_NODES_SHIFT` if it is set, and flags. The `memblock_add_range` function adds new memory region to the memory block. It starts by checking the size of the given region and if it is zero it just returns. After this, `memblock_add_range` checks for existence of the memory regions in the `memblock` structure with the given `memblock_type`. If there are no memory regions, we just fill new `memory_region` with the given values and return (we already saw the implementation of this in the [First touch of the linux kernel memory manager framework](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-3.html)). If `memblock_type` is not empty, we start to add new memory region to the `memblock` with the given `memblock_type`.
First of all we get the end of the memory region with the: First of all we get the end of the memory region with the:
@ -168,7 +168,7 @@ First of all we get the end of the memory region with the:
phys_addr_t end = base + memblock_cap_size(base, &size); phys_addr_t end = base + memblock_cap_size(base, &size);
``` ```
`memblock_cap_size` adjusts `size` that `base + size` will not overflow. Its implementation pretty easy: `memblock_cap_size` adjusts `size` that `base + size` will not overflow. Its implementation is pretty easy:
```C ```C
static inline phys_addr_t memblock_cap_size(phys_addr_t base, phys_addr_t *size) static inline phys_addr_t memblock_cap_size(phys_addr_t base, phys_addr_t *size)
@ -179,12 +179,12 @@ static inline phys_addr_t memblock_cap_size(phys_addr_t base, phys_addr_t *size)
`memblock_cap_size` returns new size which is the smallest value between the given `size` and base. `memblock_cap_size` returns new size which is the smallest value between the given `size` and base.
After that we got end address of the new memory region, `memblock_add_region` checks overlap and merge condititions with already added memory regions. Insertion of the new memory region to the `memblcok` consists from two steps: After that we have the end address of the new memory region, `memblock_add_region` checks overlap and merge condititions with already added memory regions. Insertion of the new memory region to the `memblcok` consists of two steps:
* Adding of non-overlapping parts of the new memory area as separate regions; * Adding of non-overlapping parts of the new memory area as separate regions;
* Merging of all neighbouring regions. * Merging of all neighbouring regions.
We are going throuth the all already stored memory regions and check overlapping: We are going through all the already stored memory regions and checking for overlap with the new region:
```C ```C
for (i = 0; i < type->cnt; i++) { for (i = 0; i < type->cnt; i++) {
@ -202,7 +202,7 @@ We are going throuth the all already stored memory regions and check overlapping
} }
``` ```
if new memory region does not overlap regions which are already stored in the `memblock`, insert this region into the memblock with and this is first step, we check that new region can fit into the memory block and call `memblock_double_array` in other way: If the new memory region does not overlap regions which are already stored in the `memblock`, insert this region into the memblock with and this is first step, we check that new region can fit into the memory block and call `memblock_double_array` in other way:
```C ```C
while (type->cnt + nr_new > type->max) while (type->cnt + nr_new > type->max)
@ -212,7 +212,7 @@ while (type->cnt + nr_new > type->max)
goto repeat; goto repeat;
``` ```
`memblock_double_array` doubles the size of the given regions array. Than we set insert to the `true` and go to the `repeat` label. In the second step, starting from the `repeat` label we go through the same loop and insert current memory region into the memory block with the `memblock_insert_region` function: `memblock_double_array` doubles the size of the given regions array. Than we set insert to `true` and go to the `repeat` label. In the second step, starting from the `repeat` label we go through the same loop and insert the current memory region into the memory block with the `memblock_insert_region` function:
```C ```C
if (base < end) { if (base < end) {
@ -223,7 +223,7 @@ while (type->cnt + nr_new > type->max)
} }
``` ```
As we set `insert` to `true` in the first step, now `memblock_insert_region` will be called. `memblock_insert_region` has almost the same implemetation that we saw when we insert new region to the empty `memblock_type` (see above). This function get the last memory region: As we set `insert` to `true` in the first step, now `memblock_insert_region` will be called. `memblock_insert_region` has almost the same implementation that we saw when we insert new region to the empty `memblock_type` (see above). This function gets the last memory region:
```C ```C
struct memblock_region *rgn = &type->regions[idx]; struct memblock_region *rgn = &type->regions[idx];
@ -235,9 +235,9 @@ and copies memory area with `memmove`:
memmove(rgn + 1, rgn, (type->cnt - idx) * sizeof(*rgn)); memmove(rgn + 1, rgn, (type->cnt - idx) * sizeof(*rgn));
``` ```
After this fills `memblock_region` fields of the new memory region base, size and etc... and increase size of the `memblock_type`. In the end of the exution, `memblock_add_range` calls `memblock_merge_regions` which merges neighboring compatible regions in the second step. After this fills `memblock_region` fields of the new memory region base, size and etc... and increase size of the `memblock_type`. In the end of the execution, `memblock_add_range` calls `memblock_merge_regions` which merges neighboring compatible regions in the second step.
In the second case new memory region can overlap already stored regions. For example we already have `region1` in the `memblock`: In the second case th new memory region can overlap already stored regions. For example we already have `region1` in the `memblock`:
``` ```
0 0x1000 0 0x1000
@ -279,7 +279,7 @@ if (base < end) {
} }
``` ```
In this case we insert `overlapping portion` (we insert only higher portion, because lower already in the overlapped memory region), than remaining portion and merge these portions with `memblock_merge_regions`. As i said above `memblock_merge_regions` function merges neighboring compatible regions. It goes through the all memory regions from the given `memblock_type`, takes two neighboring memory regions - `type->regions[i]` and `type->regions[i + 1]` and checks that these regions have the same flags, belong to the same node and that end address of the first regions is not equal to the base address of the second region: In this case we insert `overlapping portion` (we insert only the higher portion, because the lower portion is already in the overlapped memory region), then the remaining portion and merge these portions with `memblock_merge_regions`. As I said above `memblock_merge_regions` function merges neighboring compatible regions. It goes through the all memory regions from the given `memblock_type`, takes two neighboring memory regions - `type->regions[i]` and `type->regions[i + 1]` and checks that these regions have the same flags, belong to the same node and that end address of the first regions is not equal to the base address of the second region:
```C ```C
while (i < type->cnt - 1) { while (i < type->cnt - 1) {
@ -330,7 +330,7 @@ That's all. This is the whole principle of the work of the `memblock_add_range`
There is also `memblock_reserve` function which does the same as `memblock_add`, but only with one difference. It stores `memblock_type.reserved` in the memblock instead of `memblock_type.memory`. There is also `memblock_reserve` function which does the same as `memblock_add`, but only with one difference. It stores `memblock_type.reserved` in the memblock instead of `memblock_type.memory`.
Of course it is not full API. Memblock provides API for not only adding `memory` and `reserved` memory regions, but also: Of course this it is not the full API. Memblock provides an API for not only adding `memory` and `reserved` memory regions, but also:
* memblock_remove - removes memory region from memblock; * memblock_remove - removes memory region from memblock;
* memblock_find_in_range - finds free area in given range; * memblock_find_in_range - finds free area in given range;
@ -342,12 +342,12 @@ and many more....
Getting info about memory regions Getting info about memory regions
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
Memblock also provides API for the getting information about allocated memorey regions in the `memblcok`. It splitted on two parts: Memblock also provides an API for getting information about allocated memory regions in the `memblcok`. It is split in two parts:
* get_allocated_memblock_memory_regions_info - getting info about memory regions; * get_allocated_memblock_memory_regions_info - getting info about memory regions;
* get_allocated_memblock_reserved_regions_info - getting info about reserved regions. * get_allocated_memblock_reserved_regions_info - getting info about reserved regions.
Implementation of these function is easy. Let's look on `get_allocated_memblock_reserved_regions_info` for example: Implementation of these functions is easy. Let's look at `get_allocated_memblock_reserved_regions_info` for example:
```C ```C
phys_addr_t __init_memblock get_allocated_memblock_reserved_regions_info( phys_addr_t __init_memblock get_allocated_memblock_reserved_regions_info(
@ -363,7 +363,7 @@ phys_addr_t __init_memblock get_allocated_memblock_reserved_regions_info(
} }
``` ```
First of all this function checks that `memblock` contains reserved memory regions. If `memblock` does not contain reserved memory regions we just return zero. In other way we write physical address of the reserved memory regions array to the given address and return aligned size of the allicated aray. Note that there is `PAGE_ALIGN` macro used for align. Actually it depends on size of page: First of all this function checks that `memblock` contains reserved memory regions. If `memblock` does not contain reserved memory regions we just return zero. Otherwise we write the physical address of the reserved memory regions array to the given address and return aligned size of the allicated aray. Note that there is `PAGE_ALIGN` macro used for align. Actually it depends on size of page:
```C ```C
#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE) #define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)
@ -374,14 +374,14 @@ Implementation of the `get_allocated_memblock_memory_regions_info` function is t
Memblock debugging Memblock debugging
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
There are many calls of the `memblock_dbg` in the memblock implementation. If you will pass `memblock=debug` option to the kernel command line, this function will be called. Actually `memblock_dbg` is just a macro which expands to the `printk`: There are many calls to `memblock_dbg` in the memblock implementation. If you pass the `memblock=debug` option to the kernel command line, this function will be called. Actually `memblock_dbg` is just a macro which expands to `printk`:
```C ```C
#define memblock_dbg(fmt, ...) \ #define memblock_dbg(fmt, ...) \
if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__) if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
``` ```
For example you can see call of this macro in the `memblock_reserve` function: For example you can see a call of this macro in the `memblock_reserve` function:
```C ```C
memblock_dbg("memblock_reserve: [%#016llx-%#016llx] flags %#02lx %pF\n", memblock_dbg("memblock_reserve: [%#016llx-%#016llx] flags %#02lx %pF\n",
@ -390,7 +390,7 @@ memblock_dbg("memblock_reserve: [%#016llx-%#016llx] flags %#02lx %pF\n",
flags, (void *)_RET_IP_); flags, (void *)_RET_IP_);
``` ```
And you must see something like this: And you will see something like this:
![Memblock](http://oi57.tinypic.com/1zoj589.jpg) ![Memblock](http://oi57.tinypic.com/1zoj589.jpg)
@ -405,9 +405,9 @@ for getting dump of the `memblock` contents.
Conclusion Conclusion
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is the end of the first part about linux kernel memory management. If you have questions or suggestions, ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-internals/issues/new). This is the end of the first part about linux kernel memory management. If you have questions or suggestions, ping me on twitter [0xAX](https://twitter.com/0xAX), drop me an [email](anotherworldofworld@gmail.com) or just create an [issue](https://github.com/0xAX/linux-internals/issues/new).
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).** **Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me a PR to [linux-internals](https://github.com/0xAX/linux-internals).**
Links Links
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------

View File

@ -4,7 +4,7 @@ Linux kernel memory management Part 2.
Fix-Mapped Addresses and ioremap Fix-Mapped Addresses and ioremap
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
`Fix-Mapped` addresses is a set of the special compile-time addresses whose corresponding physical address do not have to be linear address minus `__START_KERNEL_map`. Each fix-mapped address maps one page frame and kernel uses they as pointers that never change their addresses. It is the main point of these addresses. As comment says: `to have a constant address at compile time, but to set the physical address only in the boot process`. You can remember that in the earliest [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html), we already set the `level2_fixmap_pgt`: `Fix-Mapped` addresses are a set of special compile-time addresses whose corresponding physical address do not have to be a linear address minus `__START_KERNEL_map`. Each fix-mapped address maps one page frame and the kernel uses them as pointers that never change their address. That is the main point of these addresses. As the comment says: `to have a constant address at compile time, but to set the physical address only in the boot process`. You can remember that in the earliest [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-1.html), we already set the `level2_fixmap_pgt`:
```assembly ```assembly
NEXT_PAGE(level2_fixmap_pgt) NEXT_PAGE(level2_fixmap_pgt)
@ -16,7 +16,7 @@ NEXT_PAGE(level1_fixmap_pgt)
.fill 512,8,0 .fill 512,8,0
``` ```
As you can see `level2_fixmap_pgt` is right after the `level2_kernel_pgt` which is kernel code+data+bss. Every fix-mapped address is presented by a integer index which is defined in the `fixed_addresses` enum from the [arch/x86/include/asm/fixmap.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/fixmap.h). For example it contains entries for `VSYSCALL_PAGE` - if emulation of legacy vsyscall page is enabled, `FIX_APIC_BASE` for local [apic](h As you can see `level2_fixmap_pgt` is right after the `level2_kernel_pgt` which is kernel code+data+bss. Every fix-mapped address is represented by an integer index which is defined in the `fixed_addresses` enum from the [arch/x86/include/asm/fixmap.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/fixmap.h). For example it contains entries for `VSYSCALL_PAGE` - if emulation of legacy vsyscall page is enabled, `FIX_APIC_BASE` for local [apic](h
ttp://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) and etc... In a virtual memory fix-mapped area is placed in the modules area: ttp://en.wikipedia.org/wiki/Advanced_Programmable_Interrupt_Controller) and etc... In a virtual memory fix-mapped area is placed in the modules area:
``` ```
@ -37,7 +37,7 @@ Base virtual address and size of the `fix-mapped` area are presented by the two
#define FIXADDR_START (FIXADDR_TOP - FIXADDR_SIZE) #define FIXADDR_START (FIXADDR_TOP - FIXADDR_SIZE)
``` ```
Here `__end_of_permanent_fixed_addresses` is an element of the `fixed_addresses` enum and as I wrote above: Every fix-mapped address is presented by a integer index which is defined in the `fixed_addresses`. `PAGE_SHIFT` determines size of a page. For example size of the one page we can get with the `1 << PAGE_SHIFT`. In our case we need to get the size of the fix-mapped area, but not only of one page, that's why we are using `__end_of_permanent_fixed_addresses` for getting the size of the fix-mapped area. In my case it's a little more than `536` killobytes. In your case it can be different number, because the size depends on amount of the fix-mapped addresses which are depends on your kernel's configuration. Here `__end_of_permanent_fixed_addresses` is an element of the `fixed_addresses` enum and as I wrote above: Every fix-mapped address is represented by an integer index which is defined in the `fixed_addresses`. `PAGE_SHIFT` determines size of a page. For example size of the one page we can get with the `1 << PAGE_SHIFT`. In our case we need to get the size of the fix-mapped area, but not only of one page, that's why we are using `__end_of_permanent_fixed_addresses` for getting the size of the fix-mapped area. In my case it's a little more than `536` killobytes. In your case it might be a different number, because the size depends on amount of the fix-mapped addresses which are depends on your kernel's configuration.
The second `FIXADDR_START` macro just extracts from the last address of the fix-mapped area its size for getting base virtual address of the fix-mapped area. `FIXADDR_TOP` is rounded up address from the base address of the [vsyscall](https://lwn.net/Articles/446528/) space: The second `FIXADDR_START` macro just extracts from the last address of the fix-mapped area its size for getting base virtual address of the fix-mapped area. `FIXADDR_TOP` is rounded up address from the base address of the [vsyscall](https://lwn.net/Articles/446528/) space:
@ -55,13 +55,13 @@ static __always_inline unsigned long fix_to_virt(const unsigned int idx)
} }
``` ```
first of all it check that given index of `fixed_addresses` enum is not greater or equal than `__end_of_fixed_addresses` with the `BUILD_BUG_ON` macro and than returns the result of the `__fix_to_virt` macro: first of all it checks that the index given for the `fixed_addresses` enum is not greater or equal than `__end_of_fixed_addresses` with the `BUILD_BUG_ON` macro and then returns the result of the `__fix_to_virt` macro:
```C ```C
#define __fix_to_virt(x) (FIXADDR_TOP - ((x) << PAGE_SHIFT)) #define __fix_to_virt(x) (FIXADDR_TOP - ((x) << PAGE_SHIFT))
``` ```
Here we shift left the given `fix-mapped` address index on the `PAGE_SHIFT` which determines size of a page as I wrote above and subtract it from the `FIXADDR_TOP` which is the highest address of the `fix-mapped` area. There is inverse function for getting `fix-mapped` address from a virtual address: Here we shift left the given `fix-mapped` address index on the `PAGE_SHIFT` which determines size of a page as I wrote above and subtract it from the `FIXADDR_TOP` which is the highest address of the `fix-mapped` area. There is an inverse function for getting `fix-mapped` address from a virtual address:
```C ```C
static inline unsigned long virt_to_fix(const unsigned long vaddr) static inline unsigned long virt_to_fix(const unsigned long vaddr)
@ -77,14 +77,14 @@ static inline unsigned long virt_to_fix(const unsigned long vaddr)
#define __virt_to_fix(x) ((FIXADDR_TOP - ((x)&PAGE_MASK)) >> PAGE_SHIFT) #define __virt_to_fix(x) ((FIXADDR_TOP - ((x)&PAGE_MASK)) >> PAGE_SHIFT)
``` ```
A PFN is simply in index within physical memory that is counted in page-sized units. PFN for a physical address could be trivially defined as (page_phys_addr >> PAGE_SHIFT); A PFN is simply an index within physical memory that is counted in page-sized units. PFN for a physical address could be trivially defined as (page_phys_addr >> PAGE_SHIFT);
`__virt_to_fix` clears first 12 bits in the given address, subtracts it from the last address the of `fix-mapped` area (`FIXADDR_TOP`) and shifts right result on `PAGE_SHIFT` which is `12`. Let I explain how it works. As i already wrote we will crear first 12 bits in the given address with `x & PAGE_MASK`. As we subtract this from the `FIXADDR_TOP`, we will get last 12 bits of the `FIXADDR_TOP` which are represent. We know that first 12 bits of the virtual address present offset in the page frame. With the shiting it on `PAGE_SHIFT` we will get `Page frame number` which is just all bits in a virtual address besides first 12 offset bits. `Fix-mapped` addresses are used in different [places](http://lxr.free-electrons.com/ident?i=fix_to_virt) of the linux kernel. `IDT` descriptor stored there, [Intel Trusted Execution Technology](http://en.wikipedia.org/wiki/Trusted_Execution_Technology) UUID stored in the `fix-mapped` area started from `FIX_TBOOT_BASE` index, [Xen](http://en.wikipedia.org/wiki/Xen) bootmap and many more... We already saw a little about `fix-mapped` addresses in the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) about linux kernel initialization. We used `fix-mapped` area in the early `ioremap` initialization. Let's look on it and try to understand what is it `ioremap`, how it implemented in the kernel and how it releated with the `fix-mapped` addresses. `__virt_to_fix` clears the first 12 bits in the given address, subtracts it from the last address the of `fix-mapped` area (`FIXADDR_TOP`) and shifts right result on `PAGE_SHIFT` which is `12`. Let me explain how it works. As I already wrote we will clear the first 12 bits in the given address with `x & PAGE_MASK`. As we subtract this from the `FIXADDR_TOP`, we will get the last 12 bits of the `FIXADDR_TOP` which are present. We know that the first 12 bits of the virtual address represent the offset in the page frame. With the shiting it on `PAGE_SHIFT` we will get `Page frame number` which is just all bits in a virtual address besides the first 12 offset bits. `Fix-mapped` addresses are used in different [places](http://lxr.free-electrons.com/ident?i=fix_to_virt) in the linux kernel. `IDT` descriptor stored there, [Intel Trusted Execution Technology](http://en.wikipedia.org/wiki/Trusted_Execution_Technology) UUID stored in the `fix-mapped` area started from `FIX_TBOOT_BASE` index, [Xen](http://en.wikipedia.org/wiki/Xen) bootmap and many more... We already saw a little about `fix-mapped` addresses in the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) about linux kernel initialization. We used `fix-mapped` area in the early `ioremap` initialization. Let's look on it and try to understand what is it `ioremap`, how it is implemented in the kernel and how it is releated to the `fix-mapped` addresses.
ioremap ioremap
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
Linux kernel provides many different primitives to manage memory. For this moment we will touch `I/O memory`. Every device controlled with reading/writing from/to its registers. For example driver can turn off/on a device by writing to the its registers or get state of a device by reading from its registers. Besides registers, many devices have buffer and where driver can write something or read from there. As we know for this moment there are two ways to access device's registers and data buffers: Linux kernel provides many different primitives to manage memory. For this moment we will touch `I/O memory`. Every device is controlled by reading/writing from/to its registers. For example a driver can turn off/on a device by writing to its registers or get the state of a device by reading from its registers. Besides registers, many devices have buffers where a driver can write something or read from there. As we know for this moment there are two ways to access device's registers and data buffers:
* through the I/O ports; * through the I/O ports;
* mapping of the all registers to the memory address space; * mapping of the all registers to the memory address space;
@ -120,7 +120,7 @@ $ cat /proc/ioports
... ...
``` ```
`/proc/ioporst` provides information about what driver used address of a `I/O` ports region. All of these memory regions, for example `0000-0cf7`, were claimed with the `request_region` function from the [include/linux/ioport.h](https://github.com/torvalds/linux/blob/master/include/linux/ioport.h). Actuall `request_region` is a macro which defied as: `/proc/ioporst` provides information about what driver used address of a `I/O` ports region. All of these memory regions, for example `0000-0cf7`, were claimed with the `request_region` function from the [include/linux/ioport.h](https://github.com/torvalds/linux/blob/master/include/linux/ioport.h). Actually `request_region` is a macro which defied as:
```C ```C
#define request_region(start,n,name) __request_region(&ioport_resource, (start), (n), (name), 0) #define request_region(start,n,name) __request_region(&ioport_resource, (start), (n), (name), 0)
@ -146,6 +146,7 @@ struct resource {
and contains start and end addresses of the resource, name and etc... Every `resource` structure contains pointers to the `parent`, `slibling` and `child` resources. As it has parent and childs, it means that every subset of resuorces has root `resource` structure. For example, for `I/O` ports it is `ioport_resource` structure: and contains start and end addresses of the resource, name and etc... Every `resource` structure contains pointers to the `parent`, `slibling` and `child` resources. As it has parent and childs, it means that every subset of resuorces has root `resource` structure. For example, for `I/O` ports it is `ioport_resource` structure:
```C
struct resource ioport_resource = { struct resource ioport_resource = {
.name = "PCI IO", .name = "PCI IO",
.start = 0, .start = 0,
@ -153,6 +154,7 @@ struct resource ioport_resource = {
.flags = IORESOURCE_IO, .flags = IORESOURCE_IO,
}; };
EXPORT_SYMBOL(ioport_resource); EXPORT_SYMBOL(ioport_resource);
```
Or for `iomem`, it is `iomem_resource` structure: Or for `iomem`, it is `iomem_resource` structure:
@ -163,8 +165,9 @@ struct resource iomem_resource = {
.end = -1, .end = -1,
.flags = IORESOURCE_MEM, .flags = IORESOURCE_MEM,
}; };
```
As I wrote about `request_regions` is used for registering of I/O port region and this macro used in many [places](http://lxr.free-electrons.com/ident?i=request_region) in the kernel. For example let's look on the [drivers/char/rtc.c](https://github.com/torvalds/linux/blob/master/char/rtc.c). This source code file provides [Real Time Clock](http://en.wikipedia.org/wiki/Real-time_clock) interface in the linux kernel. As every kernel module, `rtc` module contains `module_init` definition: As I wrote about `request_regions` is used for registering of I/O port region and this macro used in many [places](http://lxr.free-electrons.com/ident?i=request_region) in the kernel. For example let's look at [drivers/char/rtc.c](https://github.com/torvalds/linux/blob/master/char/rtc.c). This source code file provides [Real Time Clock](http://en.wikipedia.org/wiki/Real-time_clock) interface in the linux kernel. As every kernel module, `rtc` module contains `module_init` definition:
```C ```C
module_init(rtc_init); module_init(rtc_init);
@ -195,7 +198,7 @@ So with the `request_region(RTC_PORT(0), size, "rtc")` we register memory region
0070-0077 : rtc0 0070-0077 : rtc0
``` ```
So, we got it! Ok, it was ports. The second way is use of `I/O` memory. As I wrote above this was is mapping of control registers and memory of a device to the memory address space. `I/O` memory is a set of contiguous addresses which are provides by a device to CPU through a bus. All memory-mapped I/O addresses are not used by the kernel directly. There is special `ioremap` function which allows to covert physical address on a bus to the kernel virtual address or in another words `ioremap` maps I/O physical memory region to access it from the kernel. `ioremap` function takes two parameters: So, we got it! Ok, it was ports. The second way is use of `I/O` memory. As I wrote above this way is mapping of control registers and memory of a device to the memory address space. `I/O` memory is a set of contiguous addresses which are provided by a device to CPU through a bus. All memory-mapped I/O addresses are not used by the kernel directly. There is a special `ioremap` function which allows us to covert the physical address on a bus to the kernel virtual address or in another words `ioremap` maps I/O physical memory region to access it from the kernel. The `ioremap` function takes two parameters:
* start of the memory region; * start of the memory region;
* size of the memory region; * size of the memory region;
@ -254,7 +257,7 @@ static inline const char *e820_type_to_string(int e820_type)
and we can see it in the `/proc/iomem` (read above). and we can see it in the `/proc/iomem` (read above).
Now let's try to understand how `ioremap` works. We already know little about `ioremap`, we saw it in the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) about linux kernel initialization. If you have read this part, you can remember call of the `early_ioremap_init` function from the [arch/x86/mm/ioremap.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/ioremap.c). Initialization of the `ioremap` splitten on two parts: there is early part which we can use before normal `ioremap` is available and normal `ioremap` which is available after `vmalloc` initialization and call of the `paging_init`. We do not know anything about `vmalloc` for now, so let's consider early initialization of the `ioremap`. First of all `early_ioremap_init` checks that `fixmap` is aligned on page middle directory boundary: Now let's try to understand how `ioremap` works. We already know a little about `ioremap`, we saw it in the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Initialization/linux-initialization-5.html) about linux kernel initialization. If you have read this part, you can remember the call of the `early_ioremap_init` function from the [arch/x86/mm/ioremap.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/ioremap.c). Initialization of the `ioremap` is split inn two parts: there is the early part which we can use before the normal `ioremap` is available and the normal `ioremap` which is available after `vmalloc` initialization and call of the `paging_init`. We do not know anything about `vmalloc` for now, so let's consider early initialization of the `ioremap`. First of all `early_ioremap_init` checks that `fixmap` is aligned on page middle directory boundary:
```C ```C
BUILD_BUG_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1)); BUILD_BUG_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1));
@ -500,9 +503,9 @@ So, this is the end!
Conclusion Conclusion
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
This is the end of the second part about linux kernel memory management. If you have questions or suggestions, ping me in twitter [0xAX](https://twitter.com/0xAX), drop me [email](anotherworldofworld@gmail.com) or just create [issue](https://github.com/0xAX/linux-internals/issues/new). This is the end of the second part about linux kernel memory management. If you have questions or suggestions, ping me on twitter [0xAX](https://twitter.com/0xAX), drop me an [email](anotherworldofworld@gmail.com) or just create an [issue](https://github.com/0xAX/linux-internals/issues/new).
**Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to [linux-internals](https://github.com/0xAX/linux-internals).** **Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me a PR to [linux-internals](https://github.com/0xAX/linux-internals).**
Links Links
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------