1
0
mirror of https://github.com/0xAX/linux-insides.git synced 2025-01-05 13:21:00 +00:00

Merge pull request #314 from mudongliang/master

Fix Paging part
This commit is contained in:
0xAX 2016-01-10 19:59:32 +06:00
commit 44f3755097

View File

@ -4,19 +4,19 @@ Paging
Introduction Introduction
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
In the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) of the series `Linux kernel booting process` we learned about what the kernel does in its earliest stage. In the next step the kernel will initialize different things like `initrd` mounting, lockdep initialization, and many many others things, before we can see how the kernel runs the first init process. In the fifth [part](http://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-5.html) of the series `Linux kernel booting process` we learned about what the kernel does in its earliest stage. In the next step the kernel will initialize different things like `initrd` mounting, lockdep initialization, and many many other things, before we can see how the kernel runs the first init process.
Yeah, there will be many different things, but many many and once again many work with **memory**. Yeah, there will be many different things, but many many and once again many work with **memory**.
In my view, memory management is one of the most complex part of the linux kernel and in system programming in general. This is why before we proceed with the kernel initialization stuff, we need to get acquainted with paging. In my view, memory management is one of the most complex parts of the Linux kernel and in system programming in general. This is why we need to get acquainted with paging, before we proceed with the kernel initialization stuff.
`Paging` is a mechanism that translates a linear memory address to a physical address. If you have read the previous parts of this book, you may remember that we saw segmentation in real mode when physical addresses are calculated by shifting a segment register by four and adding an offset. We also saw segmentation in protected mode, where we used the descriptor tables and base addresses from descriptors with offsets to calculate the physical addresses. Now that we are in 64-bit mode, will see paging. `Paging` is a mechanism that translates a linear memory address to a physical address. If you have read the previous parts of this book, you may remember that we saw segmentation in real mode when physical addresses are calculated by shifting a segment register by four and adding an offset. We also saw segmentation in protected mode, where we used the descriptor tables and base addresses from descriptors with offsets to calculate the physical addresses. Now we will see paging in 64-bit mode.
As the Intel manual says: As the Intel manual says:
> Paging provides a mechanism for implementing a conventional demand-paged, virtual-memory system where sections of a programs execution environment are mapped into physical memory as needed. > Paging provides a mechanism for implementing a conventional demand-paged, virtual-memory system where sections of a programs execution environment are mapped into physical memory as needed.
So... In this post I will try to explain the theory behind paging. Of course it will be closely related to the `x86_64` version of the linux kernel, but we will not go into too much details (at least in this post). So... In this post I will try to explain the theory behind paging. Of course it will be closely related to the `x86_64` version of the Linux kernel, but we will not go into too much details (at least in this post).
Enabling paging Enabling paging
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
@ -33,7 +33,7 @@ We will only explain the last mode here. To enable the `IA-32e paging` paging mo
* set the `CR4.PAE` bit; * set the `CR4.PAE` bit;
* set the `IA32_EFER.LME` bit. * set the `IA32_EFER.LME` bit.
We already saw where those this bits were set in [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S): We already saw where those bits were set in [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S):
```assembly ```assembly
movl $(X86_CR0_PG | X86_CR0_PE), %eax movl $(X86_CR0_PG | X86_CR0_PE), %eax
@ -52,14 +52,14 @@ wrmsr
Paging structures Paging structures
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
Paging divides the linear address space into fixed-size pages. Pages can be mapped into the physical address space or even external storage. This fixed size is `4096` bytes for the `x86_64` linux kernel. To perform the linear address translation to a physical address special structures are used. Every structure is `4096` bytes size and contains `512` entries (this only for `PAE` and `IA32_EFER.LME` modes). Paging structures are hierarchical and the linux kernel uses 4 level of paging in the `x86_64` architecture. The CPU uses a part of the linear address to identify the entry in another paging structure which is at the lower level or physical memory region (`page frame`) or physical address in this region (`page offset`). The address of the top level paging structure located in the `cr3` register. We already saw this in [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S): Paging divides the linear address space into fixed-size pages. Pages can be mapped into the physical address space or external storage. This fixed size is `4096` bytes for the `x86_64` Linux kernel. To perform the translation from linear address to physical address, special structures are used. Every structure is `4096` bytes and contains `512` entries (this only for `PAE` and `IA32_EFER.LME` modes). Paging structures are hierarchical and the Linux kernel uses 4 level of paging in the `x86_64` architecture. The CPU uses a part of linear addresses to identify the entry in another paging structure which is at the lower level, physical memory region (`page frame`) or physical address in this region (`page offset`). The address of the top level paging structure located in the `cr3` register. We have already seen this in [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S):
```assembly ```assembly
leal pgtable(%ebx), %eax leal pgtable(%ebx), %eax
movl %eax, %cr3 movl %eax, %cr3
``` ```
We built the page table structures and put the address of the top-level structure in the `cr3` register. Here `cr3` is used to store the address of the top-level structure, the `PML4` or `Page Global Directory` as it is called in the linux kernel. `cr3` is 64-bit register and has the following structure: We build the page table structures and put the address of the top-level structure in the `cr3` register. Here `cr3` is used to store the address of the top-level structure, the `PML4` or `Page Global Directory` as it is called in the Linux kernel. `cr3` is 64-bit register and has the following structure:
``` ```
63 52 51 32 63 52 51 32
@ -78,24 +78,24 @@ We built the page table structures and put the address of the top-level structur
These fields have the following meanings: These fields have the following meanings:
* Bits 2:0 - ignored;
* Bits 51:12 - stores the address of the top level paging structure;
* Bit 3 and 4 - PWT or Page-Level Writethrough and PCD or Page-level cache disable indicate. These bits control the way the page or Page Table is handled by the hardware cache;
* Reserved - reserved must be 0;
* Bits 63:52 - reserved must be 0. * Bits 63:52 - reserved must be 0.
* Bits 51:12 - stores the address of the top level paging structure;
* Reserved - reserved must be 0;
* Bits 4 : 3 - PWT or Page-Level Writethrough and PCD or Page-level cache disable indicate. These bits control the way the page or Page Table is handled by the hardware cache;
* Bits 2 : 0 - ignored;
The linear address translation address is following: The linear address translation is following:
* A given linear address arrives to the [MMU](http://en.wikipedia.org/wiki/Memory_management_unit) instead of memory bus. * A given linear address arrives to the [MMU](http://en.wikipedia.org/wiki/Memory_management_unit) instead of memory bus.
* 64-bit linear address splits on some parts. Only low 48 bits are significant, it means that `2^48` or 256 TBytes of linear-address space may be accessed at any given time. * 64-bit linear address is split into some parts. Only low 48 bits are significant, it means that `2^48` or 256 TBytes of linear-address space may be accessed at any given time.
* `cr3` register stores the address of the 4 top-level paging structure. * `cr3` register stores the address of the 4 top-level paging structure.
* `47:39` bits of the given linear address stores an index into the paging structure level-4, `38:30` bits stores index into the paging structure level-3, `29:21` bits stores an index into the paging structure level-2, `20:12` bits stores an index into the paging structure level-1 and `11:0` bits provide the byte offset into the physical page. * `47:39` bits of the given linear address store an index into the paging structure level-4, `38:30` bits store index into the paging structure level-3, `29:21` bits store an index into the paging structure level-2, `20:12` bits store an index into the paging structure level-1 and `11:0` bits provide the offset into the physical page in byte.
schematically, we can imagine it like this: schematically, we can imagine it like this:
![4-level paging](http://oi58.tinypic.com/207mb0x.jpg) ![4-level paging](http://oi58.tinypic.com/207mb0x.jpg)
Every access to a linear address is either a supervisor-mode access or a user-mode access. This access is determined by the `CPL` (current privilege level). If `CPL < 3` it is a supervisor mode access level otherwise, otherwise it is a user mode access level. For example, the top level page table entry contains access bits and has the following structure: Every access to a linear address is either a supervisor-mode access or a user-mode access. This access is determined by the `CPL` (current privilege level). If `CPL < 3` it is a supervisor mode access level, otherwise it is a user mode access level. For example, the top level page table entry contains access bits and has the following structure:
``` ```
63 62 52 51 32 63 62 52 51 32
@ -117,28 +117,28 @@ Where:
* 63 bit - N/X bit (No Execute Bit) which presents ability to execute the code from physical pages mapped by the table entry; * 63 bit - N/X bit (No Execute Bit) which presents ability to execute the code from physical pages mapped by the table entry;
* 62:52 bits - ignored by CPU, used by system software; * 62:52 bits - ignored by CPU, used by system software;
* 51:12 bits - stores physical address of the lower level paging structure; * 51:12 bits - stores physical address of the lower level paging structure;
* 11:9 bits - ignored by CPU; * 11: 9 bits - ignored by CPU;
* MBZ - must be zero bits; * MBZ - must be zero bits;
* Ignored bits; * Ignored bits;
* A - accessed bit indicates was physical page or page structure accessed; * A - accessed bit indicates was physical page or page structure accessed;
* PWT and PCD used for cache; * PWT and PCD used for cache;
* U/S - user/supervisor bit controls user access to the all physical pages mapped by this table entry; * U/S - user/supervisor bit controls user access to all the physical pages mapped by this table entry;
* R/W - read/write bit controls read/write access to the all physical pages mapped by this table entry; * R/W - read/write bit controls read/write access to all the physical pages mapped by this table entry;
* P - present bit. Current bit indicates was page table or physical page loaded into primary memory or not. * P - present bit. Current bit indicates was page table or physical page loaded into primary memory or not.
Ok, we know about the paging structures and their entries. Now let's see some details about 4-level paging in the linux kernel. Ok, we know about the paging structures and their entries. Now let's see some details about 4-level paging in the Linux kernel.
Paging structures in the linux kernel Paging structures in the Linux kernel
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
As we've seen, the linux kernel in `x86_64` uses 4-level page tables. Their names are: As we've seen, the Linux kernel in `x86_64` uses 4-level page tables. Their names are:
* Page Global Directory * Page Global Directory
* Page Upper Directory * Page Upper Directory
* Page Middle Directory * Page Middle Directory
* Page Table Entry * Page Table Entry
After you've compiled and installed the linux kernel, you can see the `System.map` file which stores the virtual addresses of the functions that are used by the kernel. For example: After you've compiled and installed the Linux kernel, you can see the `System.map` file which stores the virtual addresses of the functions that are used by the kernel. For example:
``` ```
$ grep "start_kernel" System.map $ grep "start_kernel" System.map
@ -146,14 +146,14 @@ ffffffff81efe497 T x86_64_start_kernel
ffffffff81efeaa2 T start_kernel ffffffff81efeaa2 T start_kernel
``` ```
We can see `0xffffffff81efe497` here. I doubt you really have that much RAM installed. But anyway, `start_kernel` and `x86_64_start_kernel` will be executed. The address space in `x86_64` is `2^64` size, but it's too large, that's why a smaller address space is used, only 48-bits wide. So we have a situation where the physical address space is limited to 48 bits, but addressing still performed with 64 bit pointers. How is this problem solved? Look at this diagram: We can see `0xffffffff81efe497` here. I doubt you really have that much RAM installed. But anyway, `start_kernel` and `x86_64_start_kernel` will be executed. The address space in `x86_64` is `2^64` wide, but it's too large, that's why a smaller address space is used, only 48-bits wide. So we have a situation where the physical address space is limited to 48 bits, but addressing still performs with 64 bit pointers. How is this problem solved? Look at this diagram:
``` ```
0xffffffffffffffff +-----------+ 0xffffffffffffffff +-----------+
| | | |
| | Kernelspace | | Kernelspace
| | | |
0xffff800000000000 +-----------+ 0xffff800000000000 +-----------+
| | | |
| | | |
| hole | | hole |
@ -166,7 +166,7 @@ We can see `0xffffffff81efe497` here. I doubt you really have that much RAM inst
0x0000000000000000+-----------+ 0x0000000000000000+-----------+
``` ```
This solution is `sign extension`. Here we can see that the lower 48 bits of a virtual address can be used for addressing. Bits `63:48` can be either only zeroes or only ones. Note that the virtual address space is split in 2 parts: This solution is `sign extension`. Here we can see that the lower 48 bits of a virtual address can be used for addressing. Bits `63:48` can be either only zeroes or only ones. Note that the virtual address space is split into 2 parts:
* Kernel space * Kernel space
* Userspace * Userspace
@ -201,13 +201,13 @@ We can see here the memory map for user space, kernel space and the non-canonica
Previously this guard hole and `__PAGE_OFFSET` was from `0xffff800000000000` to `0xffff80ffffffffff` to prevent access to non-canonical area, but was later extended by 3 bits for the hypervisor. Previously this guard hole and `__PAGE_OFFSET` was from `0xffff800000000000` to `0xffff80ffffffffff` to prevent access to non-canonical area, but was later extended by 3 bits for the hypervisor.
Next is the lowest usable address in kernel space - `ffff880000000000`. This virtual memory region is for direct mapping of the all physical memory. After the memory space which maps all physical addresses, the guard hole. It needs to be between the direct mapping of all the physical memory and the vmalloc area. After the virtual memory map for the first terabyte and the unused hole after it, we can see the `kasan` shadow memory. It was added by [commit](https://github.com/torvalds/linux/commit/ef7f0d6a6ca8c9e4b27d78895af86c2fbfaeedb2) and provides the kernel address sanitizer. After the next unused hole we can see the `esp` fixup stacks (we will talk about it in other parts of this book) and the start of the kernel text mapping from the physical address - `0`. We can find the definition of this address in the same file as the `__PAGE_OFFSET`: Next is the lowest usable address in kernel space - `ffff880000000000`. This virtual memory region is for direct mapping of all the physical memory. After the memory space which maps all the physical addresses, the guard hole. It needs to be between the direct mapping of all the physical memory and the vmalloc area. After the virtual memory map for the first terabyte and the unused hole after it, we can see the `kasan` shadow memory. It was added by [commit](https://github.com/torvalds/linux/commit/ef7f0d6a6ca8c9e4b27d78895af86c2fbfaeedb2) and provides the kernel address sanitizer. After the next unused hole we can see the `esp` fixup stacks (we will talk about it in other parts of this book) and the start of the kernel text mapping from the physical address - `0`. We can find the definition of this address in the same file as the `__PAGE_OFFSET`:
```C ```C
#define __START_KERNEL_map _AC(0xffffffff80000000, UL) #define __START_KERNEL_map _AC(0xffffffff80000000, UL)
``` ```
Usually kernel's `.text` start here with the `CONFIG_PHYSICAL_START` offset. We saw it in the post about [ELF64](https://github.com/0xAX/linux-insides/blob/master/Theory/ELF.md): Usually kernel's `.text` starts here with the `CONFIG_PHYSICAL_START` offset. We have seen it in the post about [ELF64](https://github.com/0xAX/linux-insides/blob/master/Theory/ELF.md):
``` ```
readelf -s vmlinux | grep ffffffff81000000 readelf -s vmlinux | grep ffffffff81000000
@ -216,11 +216,11 @@ readelf -s vmlinux | grep ffffffff81000000
90766: ffffffff81000000 0 NOTYPE GLOBAL DEFAULT 1 startup_64 90766: ffffffff81000000 0 NOTYPE GLOBAL DEFAULT 1 startup_64
``` ```
Here i checked `vmlinux` with the `CONFIG_PHYSICAL_START` is `0x1000000`. So we have the start point of the kernel `.text` - `0xffffffff80000000` and offset - `0x1000000`, the resulted virtual address will be `0xffffffff80000000 + 1000000 = 0xffffffff81000000`. Here I check `vmlinux` with `CONFIG_PHYSICAL_START` is `0x1000000`. So we have the start point of the kernel `.text` - `0xffffffff80000000` and offset - `0x1000000`, the resulted virtual address will be `0xffffffff80000000 + 1000000 = 0xffffffff81000000`.
After the kernel `.text` region there is the virtual memory region for kernel modules, `vsyscalls` and an unused hole of 2 megabytes. After the kernel `.text` region there is the virtual memory region for kernel module, `vsyscalls` and an unused hole of 2 megabytes.
We've seen how the kernel's virtual memory map is laid out and how a virtual address is translated into a physical one. Let's take for example following address: We've seen how virtual memory map in the kernel is laid out and how a virtual address is translated into a physical one. Let's take the following address as example:
``` ```
0xffffffff81000000 0xffffffff81000000
@ -236,22 +236,21 @@ In binary it will be:
This virtual address is split in parts as described above: This virtual address is split in parts as described above:
* `63:48` - bits not used; * `63:48` - bits not used;
* `47:39` - bits of the given linear address stores an index into the paging structure level-4; * `47:39` - bits store an index into the paging structure level-4;
* `38:30` - bits stores index into the paging structure level-3; * `38:30` - bits store index into the paging structure level-3;
* `29:21` - bits stores an index into the paging structure level-2; * `29:21` - bits store an index into the paging structure level-2;
* `20:12` - bits stores an index into the paging structure level-1; * `20:12` - bits store an index into the paging structure level-1;
* `11:0` - bits provide the byte offset into the physical page. * `11:0` - bits provide the offset into the physical page in byte.
That is all. Now you know a little about theory of `paging` and we can go ahead in the kernel source code and see the first initialization steps. That is all. Now you know a little about theory of `paging` and we can go ahead in the kernel source code and see the first initialization steps.
Conclusion Conclusion
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
It's the end of this short part about paging theory. Of course this post doesn't cover every detail of paging, but soon we'll see in practice how the linux kernel builds paging structures and works with them. It's the end of this short part about paging theory. Of course this post doesn't cover every detail of paging, but soon we'll see in practice how the Linux kernel builds paging structures and works with them.
**Please note that English is not my first language and I am really sorry for any inconvenience. If you've found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).** **Please note that English is not my first language and I am really sorry for any inconvenience. If you've found any mistakes please send me PR to [linux-insides](https://github.com/0xAX/linux-insides).**
Links Links
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------