You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

29 KiB

Kernel initialization. Part 6.

Architecture-specific initialization, again...

In the previous part we saw architecture-specific (x86_64 in our case) initialization stuff from the arch/x86/kernel/setup.c and finished on x86_configure_nx function which sets the _PAGE_NX flag depending on the support of NX bit. As I wrote before setup_arch function and start_kernel are very big, so in this and in the next part we will continue to learn about architecture-specific initialization process. The next function after x86_configure_nx is parse_early_param. This function is defined in the init/main.c and as you can understand from its name, this function parses kernel command line and setups different services depending on the given parameters (all kernel command line parameters you can find in the Documentation/kernel-parameters.txt). You may remember how we set up earlyprintk in the earliest part. At the early stage we looked for kernel parameters and their value using the cmdline_find_option function and __cmdline_find_option, __cmdline_find_option_bool helpers from the arch/x86/boot/cmdline.c. In these functions we were in the generic kernel part not dependent on architecture whereas here we undertake another approach. If you have been reading Linux kernel source code, you probably already noticed calls like this:

early_param("gbpages", parse_direct_gbpages_on);

early_param macro takes two parameters:

  • command line parameter name;
  • function which will be called if given parameter is passed.

and defined as:

#define early_param(str, fn) \
        __setup_param(str, fn, fn, 1)

in the include/linux/init.h. As you can see early_param macro just calls the __setup_param macro:

#define __setup_param(str, unique_id, fn, early)                \
        static const char __setup_str_##unique_id[] __initconst \
                __aligned(1) = str; \
        static struct obs_kernel_param __setup_##unique_id      \
                __used __section(.init.setup)                   \
                __attribute__((aligned((sizeof(long)))))        \
                = { __setup_str_##unique_id, fn, early }

This macro defines __setup_str_*_id variable (where * depends on the given function name) and assigns it to the given command line parameter name. In the next line we can see the definition of the __setup_* variable of type obs_kernel_param and its initialization. obs_kernel_param structure defined as:

struct obs_kernel_param {
        const char *str;
        int (*setup_func)(char *);
        int early;
};

and contains three fields:

  • name of the kernel parameter;
  • function which setups something depending on the parameter;
  • field determining if the parameter is early (1) or not (0).

Note that __set_param macro is defined with __section(.init.setup) attribute. It means that all __setup_str_* will be placed in the .init.setup section. Moreover, as we can see in the include/asm-generic/vmlinux.lds.h, they will be placed between __setup_start and __setup_end:

#define INIT_SETUP(initsetup_align)                \
                . = ALIGN(initsetup_align);        \
                VMLINUX_SYMBOL(__setup_start) = .; \
                *(.init.setup)                     \
                VMLINUX_SYMBOL(__setup_end) = .;

Now we know how parameters are defined, let's go back to the parse_early_param implementation:

void __init parse_early_param(void)
{
        static int done __initdata;
        static char tmp_cmdline[COMMAND_LINE_SIZE] __initdata;

        if (done)
                return;

        /* All fall through to do_early_param. */
        strlcpy(tmp_cmdline, boot_command_line, COMMAND_LINE_SIZE);
        parse_early_options(tmp_cmdline);
        done = 1;
}

The parse_early_param function defines two static variables. First done checks that parse_early_param was already called and the second is a temporary storage for the kernel command line. After this we copy boot_command_line to the temporary command line which we just defined and call the parse_early_options function from the same source code main.c file. parse_early_options calls the parse_args function from the kernel/params.c where parse_args parses given command line and calls the do_early_param function. This function goes from the __setup_start to __setup_end, and calls the functions from the obs_kernel_param if a parameter is early. After this all services dependent on early command line parameters have been setup and the next call after the parse_early_param is x86_report_nx. As I wrote in the beginning of this part, we already set NX-bit with the x86_configure_nx. The next x86_report_nx function from the arch/x86/mm/setup_nx.c just prints information about the NX. Note that we do not call x86_report_nx right after the x86_configure_nx, but after the call to the parse_early_param. The answer as to why we do so is simple: we call it after the parse_early_param because the kernel supports noexec parameter:

noexec		[X86]
			On X86-32 available only on PAE configured kernels.
			noexec=on: enable non-executable mappings (default)
			noexec=off: disable non-executable mappings

We can see it during the boot time:

NX

After this we can see the call to the:

	memblock_x86_reserve_range_setup_data();

function. This function is defined in the same arch/x86/kernel/setup.c source code file and remaps memory for the setup_data and reserved memory block for the setup_data (more about setup_data you can read in the previous part and about ioremap and memblock you can read in the Linux kernel memory management).

In the next step we can see the following conditional statement:

	if (acpi_mps_check()) {
#ifdef CONFIG_X86_LOCAL_APIC
		disable_apic = 1;
#endif
		setup_clear_cpu_cap(X86_FEATURE_APIC);
	}

The first acpi_mps_check function from the arch/x86/kernel/acpi/boot.c depends on CONFIG_X86_LOCAL_APIC and CONFIG_x86_MPPARSE configuration options:

int __init acpi_mps_check(void)
{
#if defined(CONFIG_X86_LOCAL_APIC) && !defined(CONFIG_X86_MPPARSE)
        /* mptable code is not built-in*/
        if (acpi_disabled || acpi_noirq) {
                printk(KERN_WARNING "MPS support code is not built-in.\n"
                       "Using acpi=off or acpi=noirq or pci=noacpi "
                       "may have problem\n");
                 return 1;
        }
#endif
        return 0;
}

It checks the built-in MPS or MultiProcessor Specification table. If CONFIG_X86_LOCAL_APIC is set and CONFIG_x86_MPPARSE is not set, acpi_mps_check prints warning message if one of the command line options: acpi=off, acpi=noirq or pci=noacpi was passed to the kernel. If acpi_mps_check returns 1 it means that we disable local APIC and clear X86_FEATURE_APIC bit in the of the current CPU with the setup_clear_cpu_cap macro. (you can read more about CPU masks in the CPU masks).

Early PCI dump

In the next step we dump the PCI devices with the following code:

#ifdef CONFIG_PCI
	if (pci_early_dump_regs)
		early_dump_pci_devices();
#endif

pci_early_dump_regs variable defined in the arch/x86/pci/common.c and its value depend on the kernel command line parameter: pci=earlydump. We can find definition of this parameter in the drivers/pci/pci.c:

early_param("pci", pci_setup);

pci_setup function gets the string after the pci= and analyzes it. This function calls the pcibios_setup which is defined as __weak in the drivers/pci/pci.c and every architecture defines the same function which overrides its __weak analog. For example x86_64 architecture-dependent version is in the arch/x86/pci/common.c:

char *__init pcibios_setup(char *str) {
        ...
		...
		...
		} else if (!strcmp(str, "earlydump")) {
                pci_early_dump_regs = 1;
                return NULL;
        }
		...
		...
		...
}

So, if CONFIG_PCI option is set and we passed pci=earlydump option to the kernel command line, next function which will be called is early_dump_pci_devices from the arch/x86/pci/early.c. This function checks noearly PCI parameter with:

if (!early_pci_allowed())
        return;

and returns if it was passed. Each PCI domain can host up to 256 buses and each bus hosts up to 32 devices. So, we proceed in a loop:

for (bus = 0; bus < 256; bus++) {
                for (slot = 0; slot < 32; slot++) {
                        for (func = 0; func < 8; func++) {
						...
						...
						...
                        }
                }
}

and read the pci config with the read_pci_config function.

That's all. Here we won't go deeper in the pci details, but rather do it in the special Drivers/PCI part.

Finish with memory parsing

After the early_dump_pci_devices, there are a couple of function related to the available memory and e820 which we collected in the First steps in the kernel setup part:

	/* update the e820_saved too */
	e820_reserve_setup_data();
	finish_e820_parsing();
	...
	...
	...
	e820_add_kernel_range();
	trim_bios_range(void);
	max_pfn = e820_end_of_ram_pfn();
	early_reserve_e820_mpc_new();

Let's take a look at it now. As you can see the first function is e820_reserve_setup_data. This function does almost the same thing as memblock_x86_reserve_range_setup_data which we saw above, but it also calls e820_update_range which adds new regions to the e820map with the given type which is E820_RESERVED_KERN in our case. The next function is finish_e820_parsing which sanitizes e820map with the sanitize_e820_map function. Besides these two functions we can see a couple of functions related to the e820. You can see it in the listing above. e820_add_kernel_range function takes the physical addresses of the kernel start and end:

u64 start = __pa_symbol(_text);
u64 size = __pa_symbol(_end) - start;

then checks that .text .data and .bss are marked as E820RAM in the e820map and prints the warning message if not. The next function trim_bios_range updates first 4096 bytes in e820Map as E820_RESERVED and sanitizes it again with the call of the sanitize_e820_map. After this we get the last page frame number with the call to the e820_end_of_ram_pfn function. Every memory page has a unique number - Page frame number and e820_end_of_ram_pfn function returns the maximum with the call of the e820_end_pfn:

unsigned long __init e820_end_of_ram_pfn(void)
{
	return e820_end_pfn(MAX_ARCH_PFN);
}

where e820_end_pfn takes maximum page frame number on the certain architecture (MAX_ARCH_PFN is 0x400000000 for x86_64). In the e820_end_pfn we go through the all e820 slots and check that e820 entry has E820_RAM or E820_PRAM type because we calculate page frame numbers only for these types, then we get the base address and the end address of the page frame number for the current e820 entry and perform some checks for these addresses:

for (i = 0; i < e820.nr_map; i++) {
		struct e820entry *ei = &e820.map[i];
		unsigned long start_pfn;
		unsigned long end_pfn;

		if (ei->type != E820_RAM && ei->type != E820_PRAM)
			continue;

		start_pfn = ei->addr >> PAGE_SHIFT;
		end_pfn = (ei->addr + ei->size) >> PAGE_SHIFT;

    if (start_pfn >= limit_pfn)
			continue;
		if (end_pfn > limit_pfn) {
			last_pfn = limit_pfn;
			break;
		}
		if (end_pfn > last_pfn)
			last_pfn = end_pfn;
}
	if (last_pfn > max_arch_pfn)
		last_pfn = max_arch_pfn;

	printk(KERN_INFO "e820: last_pfn = %#lx max_arch_pfn = %#lx\n",
			 last_pfn, max_arch_pfn);
	return last_pfn;

After this we check that last_pfn (which we obtained inside the loop) is not greater that maximum page frame number for the current architecture (x86_64 in our case), print information about the last page frame number and return it. We can observe the last_pfn in the dmesg output:

...
[    0.000000] e820: last_pfn = 0x41f000 max_arch_pfn = 0x400000000
...

After this, as we have calculated the greatest page frame number, we calculate max_low_pfn which is the greatest page frame number in the low memory or below the first 4 gigabytes. If more than 4 gigabytes of RAM are installed, max_low_pfn will be the result of the e820_end_of_low_ram_pfn function which does the same thing as e820_end_of_ram_pfn but with the 4 gigabytes limit, in other case max_low_pfn will be equal to max_pfn:

if (max_pfn > (1UL<<(32 - PAGE_SHIFT)))
	max_low_pfn = e820_end_of_low_ram_pfn();
else
	max_low_pfn = max_pfn;

high_memory = (void *)__va(max_pfn * PAGE_SIZE - 1) + 1;

Next we calculate high_memory (defining the upper bound on direct map memory) with __va macro which returns a virtual address from the given physical memory.

DMI scanning

The next step after manipulations with different memory regions and e820 slots is collecting information about the computer. We will get all information with the Desktop Management Interface and following functions:

dmi_scan_machine();
dmi_memdev_walk();

First is dmi_scan_machine defined in the drivers/firmware/dmi_scan.c. This function goes through the System Management BIOS structures and extracts information. There are two ways specified to gain access to the SMBIOS table: get the pointer to the SMBIOS table from the EFI's configuration table and scanning the physical memory between 0xF0000 and 0xFFFFF addresses, resulting in 0x10000 bytes totally. Let's take a look at the second approach. dmi_scan_machine function remaps memory addresses start from 0xF0000 to 0xFFFFF with size of 0x10000 bytes using the dmi_early_remap which simply expands to the early_ioremap:

void __init dmi_scan_machine(void)
{
	char __iomem *p, *q;
	char buf[32];
	...
	...
	...
	p = dmi_early_remap(0xF0000, 0x10000);
	if (p == NULL)
			goto error;

and iterates over all DMI header addresses and matches the _SM_ string:

memset(buf, 0, 16);
for (q = p; q < p + 0x10000; q += 16) {
		memcpy_fromio(buf + 16, q, 16);
		if (!dmi_smbios3_present(buf) || !dmi_present(buf)) {
			dmi_available = 1;
			dmi_early_unmap(p, 0x10000);
			goto out;
		}
		memcpy(buf, buf + 16, 16);
}

_SM_ string must be between 000F0000h and 0x000FFFFF. Here we copy 16 bytes to the buf with memcpy_fromio which is the same as memcpy and execute dmi_smbios3_present and dmi_present on the buffer. These functions check that first 4 bytes match the _SM_ string, get the SMBIOS version and get _DMI_ attributes as DMI structure table length, table address and etc... After one of these functions finishes, you will see the result of it in the dmesg output:

[    0.000000] SMBIOS 2.7 present.
[    0.000000] DMI: Gigabyte Technology Co., Ltd. Z97X-UD5H-BK/Z97X-UD5H-BK, BIOS F6 06/17/2014

In the end of the dmi_scan_machine, we unmap the previously remapped memory:

dmi_early_unmap(p, 0x10000);

The second function is - dmi_memdev_walk. As you can understand it goes over memory devices. Let's look on it:

void __init dmi_memdev_walk(void)
{
	if (!dmi_available)
		return;

	if (dmi_walk_early(count_mem_devices) == 0 && dmi_memdev_nr) {
		dmi_memdev = dmi_alloc(sizeof(*dmi_memdev) * dmi_memdev_nr);
		if (dmi_memdev)
			dmi_walk_early(save_mem_devices);
	}
}

It checks that DMI is available (we got it in the previous function - dmi_scan_machine) and collects information about memory devices with dmi_walk_early and dmi_alloc which are defined as:

#ifdef CONFIG_DMI
RESERVE_BRK(dmi_alloc, 65536);
#endif

RESERVE_BRK is defined in the arch/x86/include/asm/setup.h and reserves space of given size in the brk section.


init_hypervisor_platform();
x86_init.resources.probe_roms();
insert_resource(&iomem_resource, &code_resource);
insert_resource(&iomem_resource, &data_resource);
insert_resource(&iomem_resource, &bss_resource);
early_gart_iommu_check();

SMP config

The next step is parsing the SMP configuration. We do it with the call to the find_smp_config function which just calls the function:

static inline void find_smp_config(void)
{
        x86_init.mpparse.find_smp_config();
}

x86_init.mpparse.find_smp_config is the default_find_smp_config function from the arch/x86/kernel/mpparse.c. In the default_find_smp_config function we are scanning a couple of memory regions for SMP config and return if they are found:

if (smp_scan_config(0x0, 0x400) ||
            smp_scan_config(639 * 0x400, 0x400) ||
            smp_scan_config(0xF0000, 0x10000))
            return;

First of all smp_scan_config function defines a couple of variables:

unsigned int *bp = phys_to_virt(base);
struct mpf_intel *mpf;

First is virtual address of the memory region where we will scan the SMP config, second is the pointer to the mpf_intel structure. Let's try to understand what mpf_intel is. All information regarding SMP is stored in the multiprocessor configuration data structure. mpf_intel represents this structure and looks as follows:

struct mpf_intel {
        char signature[4];
        unsigned int physptr;
        unsigned char length;
        unsigned char specification;
        unsigned char checksum;
        unsigned char feature1;
        unsigned char feature2;
        unsigned char feature3;
        unsigned char feature4;
        unsigned char feature5;
};

As we can read in the documentation - one of the main functions of the system BIOS is to construct the MP floating pointer structure and the MP configuration table. The operating system must have access to this information about the multiprocessor configuration and mpf_intel stores the physical address (look at second parameter) of the multiprocessor configuration table. So, smp_scan_config loops through the given memory range and tries to find MP floating pointer structure there. It checks that the current byte points to the SMP signature, checks checksum, checks if mpf->specification is 1 or 4(it must be 1 or 4 according to the specification) in the loop:

while (length > 0) {
if ((*bp == SMP_MAGIC_IDENT) &&
    (mpf->length == 1) &&
    !mpf_checksum((unsigned char *)bp, 16) &&
    ((mpf->specification == 1)
    || (mpf->specification == 4))) {

        mem = virt_to_phys(mpf);
        memblock_reserve(mem, sizeof(*mpf));
        if (mpf->physptr)
            smp_reserve_memory(mpf);
	}
}

reserves given memory block if search is successful with memblock_reserve and reserves physical address of the multiprocessor configuration table. You can find documentation about this in the - MultiProcessor Specification. You can read more details about that in the special part about SMP.

Additional early memory initialization routines

In the next step of the setup_arch we can see the call to the early_alloc_pgt_buf function which allocates the page table buffer for the early stage. The page table buffer will be placed in the brk area. Let's look at its implementation:

void  __init early_alloc_pgt_buf(void)
{
        unsigned long tables = INIT_PGT_BUF_SIZE;
        phys_addr_t base;

        base = __pa(extend_brk(tables, PAGE_SIZE));

        pgt_buf_start = base >> PAGE_SHIFT;
        pgt_buf_end = pgt_buf_start;
        pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
}

First, it gets the size of the page table buffer - INIT_PGT_BUF_SIZE which in the current Linux kernel (4.0) is (6 * PAGE_SIZE). As we got the size of the page table buffer, we call the extend_brk function with two parameters: size and align. As you can understand from its name, this function extends the brk area. As we can see in the Linux kernel linker script brk is in memory right after the BSS:

	. = ALIGN(PAGE_SIZE);
	.brk : AT(ADDR(.brk) - LOAD_OFFSET) {
		__brk_base = .;
		. += 64 * 1024;		/* 64k alignment slop space */
		*(.brk_reservation)	/* areas brk users have reserved */
		__brk_limit = .;
	}

Or we can find it using the readelf util:

brk area

After we got physical address of the new brk with the __pa macro, we calculate the base address and the end of the page table buffer. In the next step as we got page table buffer, we reserve memory block for the brk area with the reserve_brk function:

static void __init reserve_brk(void)
{
	if (_brk_end > _brk_start)
		memblock_reserve(__pa_symbol(_brk_start),
				 _brk_end - _brk_start);

	_brk_start = 0;
}

Note that at the end of the reserve_brk, we set brk_start to zero, because after this we will not allocate it anymore. As the next step after reserving memory block for the brk, we need to unmap out-of-range memory areas in the kernel mapping with the cleanup_highmap function. Remember that kernel mapping is either __START_KERNEL_map and _end - _text or level2_kernel_pgt which maps the kernel _text, data and bss segments. At the start of the clean_high_map function we define these parameters:

unsigned long vaddr = __START_KERNEL_map;
unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
pmd_t *pmd = level2_kernel_pgt;
pmd_t *last_pmd = pmd + PTRS_PER_PMD;

Now, as we have defined start and end of the kernel mapping, we go in a loop through all the kernel page middle directory entries and clean entries which are not between _text and end:

for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
        if (pmd_none(*pmd))
            continue;
        if (vaddr < (unsigned long) _text || vaddr > end)
            set_pmd(pmd, __pmd(0));
}

After this we set the limit for the memblock allocation with the memblock_set_current_limit function (you can read more about memblock in the Linux kernel memory management Part 2); it will be either ISA_END_ADDRESS or 0x100000 and will fill the memblock information according to e820 with the call to the memblock_x86_fill function. You can see the result of this function being printed during the kernel initialization time:

MEMBLOCK configuration:
 memory size = 0x1fff7ec00 reserved size = 0x1e30000
 memory.cnt  = 0x3
 memory[0x0]	[0x00000000001000-0x0000000009efff], 0x9e000 bytes flags: 0x0
 memory[0x1]	[0x00000000100000-0x000000bffdffff], 0xbfee0000 bytes flags: 0x0
 memory[0x2]	[0x00000100000000-0x0000023fffffff], 0x140000000 bytes flags: 0x0
 reserved.cnt  = 0x3
 reserved[0x0]	[0x0000000009f000-0x000000000fffff], 0x61000 bytes flags: 0x0
 reserved[0x1]	[0x00000001000000-0x00000001a57fff], 0xa58000 bytes flags: 0x0
 reserved[0x2]	[0x0000007ec89000-0x0000007fffffff], 0x1377000 bytes flags: 0x0

The rest functions after the memblock_x86_fill are: early_reserve_e820_mpc_new that allocates additional slots in the e820map for MultiProcessor Specification table, reserve_real_mode - reserves low memory from 0x0 to 1 megabyte for the trampoline to the real mode (for rebooting, etc.), trim_platform_memory_ranges - trims certain memory regions starting from 0x20050000, 0x20110000, etc. - these regions must be excluded because Sandy Bridge has problems with these regions, trim_low_memory_range reserves the first 4 kilobyte page from the memblock are, init_mem_mapping function reconstructs direct memory mapping and setups the direct mapping of the physical memory at PAGE_OFFSET, early_trap_pf_init setups #PF handler (we will look at it in the chapter about interrupts), and lastly setup_real_mode function setups trampoline to the real mode code.

That's all. You can note that this part did not cover all the functions which are in the setup_arch (like early_gart_iommu_check, mtrr initialization, etc.). As I already wrote many times, setup_arch is big, and Linux kernel is big. That's why I can't cover every line in the Linux kernel. I don't think that we missed something important, but you can say something like: each line of code is important. Yes, it's true, but I omitted them anyway, because I think that it is not realistic to cover full Linux kernel. Anyway, we will often return to ideas we have already seen, and if something is unfamiliar, we will cover it in more detail.

Conclusion

It is the end of the sixth part about Linux kernel initialization process. In this part we continued to dive in the setup_arch function again and it was long, but we are still not finished with it. Yes, setup_arch is big and I hope that next part will be the last part about this function.

If you have any questions or suggestions write me a comment or ping me at twitter.

Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.